Enhanced security computer processor with mentor circuits

ABSTRACT

A computing device includes a plurality of bins distributed in a plurality of frames, and a plurality of mentor circuits. The bins store information for variables. Each mentor circuit may be assigned to a particular one or more of the variables. The mentor circuits perform cache management and operand addressing operations with respect to the particular variables to which the mentor circuit is assigned. A control circuit controls a main program flow.

PRIORITY CLAIM

This application is a continuation of U.S. patent application Ser. No.16/195,805 entitled “ENHANCED SECURITY COMPUTER PROCESSOR WITH MENTORCIRCUITS” filed Nov. 19, 2018 (the “'805 application”), which is acontinuation of U.S. patent application Ser. No. 14/863,052 entitled“COMPUTER PROCESSOR WITH OPERAND/VARIABLE-MAPPED NAMESPACE” filed Sep.23, 2015 (the “'052 application”). The entire contents of both the '805application and the '052 application are incorporated herein byreference as if fully set forth herein.

BACKGROUND Field

The present invention relates to the field of computer processors andmethods of computing. More specifically, embodiments herein relate tocomputer processor architecture implementing multiple control layers.

Description of the Related Art

At one time, computer performance grew proportionally to transistordensity. In the mainframe era, for example, the major limitation toperformance using single transistors or MSI/LSI chips was physical size,limited by number of transistors per cubic foot of mainframe cabinets.The larger the physical size the slower the cycle time.

In the era of the processor on a chip, yield became the major limitingfactor. Along with finer silicon feature sizes more transistors per diebecame available at acceptable chip yield enabling; first going to widerword sizes, than to simple pipelining (one instruction per cycle)followed by Instruction Level Parallelism (ILP, up to 4 instructions percycle). The smaller feature size also enabled higher clock frequency andless power consumption per transistor junction, all of which combined tooffer much higher performance as level of integration increased.

Past about 2005 the picture has changed as evident by the arrival ofmulti core chips. Instead of getting four times (or more) fasterprocessor for quadrupling the transistors per die (along with doublingcycle time) as was the case in moving from 16 bits to 32 bit processors,presently when higher transistor count is used to implement ILP registerarchitecture processors, the design brings diminishing returns onperformance (this issue is known in the industry as Pollack's Rule).Thus economics has encouraged the industry into moving to multi core inorder to take advantage of the available transistor count per die. Alsoit is noted that for FP programs, ILP machines may achieve 0.5 FP percycle, where the theoretical limit for using one adder and onemultiplier is 2.0, to go further multiple FU copies are thus required.Increasing performance by including multiple copies of functional unitmay require shadow register structures whose complexity may far exceedthe complexity of the systems described herein. Limits to improvementsof scalar performance of computer hardware have been characterized as“walls”, including, for example, a power wall, a memory wall, and aninstruction level parallelism (“ILP” wall).

While the approach presented herein overcomes disadvantages both in the“Memory wall” and the “ILP wall” we will concentrate our presentation onthe effects on “ILP wall” effects first. Some effects improving the“Memory wall” issues will also be noted.

The inability of processor architecture to take advantage of theincreased transistor per die to gain performance advantage (the “ILPWall”) is linked to the register machine namespace interfering in bothmicro-parallelism and macro-parallelism. Regarding micro-parallelismhaving “registers” as part of the processor's namespace serializesprocesses that are “embarrassingly parallel”.

Present macro-parallelism is limited in some respects due to processor'sneed to share information and the lack of efficient mechanismsresponsible for the integrity of shared variables, an issues addressedherein by the Mentors.

The larger the systems, the larger the memory one wants to access. Toaccess more memory, one has to go off chip. Thus limits on the speed ofthe clock cycle (power wall) and pin to pin interconnect (memory wall)are at play due to multi chip interconnects and the disparity in cycletimes among the different layers of the memory hierarchy (memory wall).Moreover, existing computer architecture may not be expandable toefficiently take advantage of the larger available transistor count.Existing ILP register architecture may be effectively limited by whathas been referred to as an “ILP wall”. In register machines theunavailability of operands is mostly caused not by the intrinsic datadependency relations in the source HLL program but it is caused by theeffects of the register architecture's namespace management. The moreone attempts to speed up performance by the use of parallel operationsintrinsic in the original HLL algorithm, the more interference in theprocess is due to the “register” namespace mechanism. Techniques likeshadow register provide some relief but they soon become too complex toprovide true solutions. In the Von Neumann model the namespace is[Memory]+PC, the only “named” entities are operands and instructionsaddresses in memory and the program counter. There are two basic problemwith the original Von Neumann architecture (the “three address machinearchitecture” A<=B+C for example see SWAC), the first is that thearchitecture requires four memory accesses delays per instruction. Onememory access for the instruction fetch, two for fetching two dataoperands and one for storing the results. The second problem is that asmemory address size increases, instruction size increases by three foldas each instruction contain three addresses. Typical registerarchitectures reduced the number of memory access per instruction totwo, one for instruction and one for data, and each instruction containsonly a single memory address keeping instruction size manageable. InRISC machines memory accesses are even less as most instructions do notaccess memory. However register architecture significantly complicatedthe namespace. The namespace in a register machine is:[Memory]+Registers+PC+PSDW+CC(condition codes).

The introduction of vector registers improved performance in programsthat exhibited micro parallelism. However in the long run vectorregisters further complicated namespace and the software mobilityissues. The namespace in vector machines is:[Memory]+PC+Registers+Vector Registers+PSDW+CC. The namespace mechanism,for register architectures and registers+vector architectures is a majorfactor in the creation of the “ILP Wall”. Once cache is introduced intothe picture, cache solves the same operand access delays (staging)problem that registers and vector register originally solved. Operandscan be used within one or two cycles in one-operand-per-cycle streamfrom either the cache, the registers or the vector registers. From thatpoint on, registers and vector registers may further complicate thenamespace and coherency issues. Coherency may be lost, for example, whenthe program changes operand values already staged in cache, a registeror a vector registers. Therefore, once one introduces caches into thearchitecture, the real advantage of register architecture over VonNeumann architecture is in smaller instruction size and thus possiblysmaller program size, advantages that can be overcome by namespacemapping methods.

For historical and other reasons both computer machine languages andHLLs do not include the concept and semantics of “plural form” as partof the language for expressing algorithms. For insight at where HLLs didpropose (FORTRAN) extensions that do recognize this subject please see“FORALL in Parallel” and “FORALL In Synch” in Modula 2.

In simple algorithms (some micro-parallelism type codes) the existenceof parallelism may be deduced by the compiler from the “N timessingular” form of the DO or FOR loops.

For insight, “Company about face” is linguistically a plural languageform of an instruction in English. While “DO I=1, N; Soldier (I) aboutface; END DO” is an “N times singular” linguistic form. A characteristiceffect of the use of “N times singular” form is that it typicallytransforms a parallel process to a serial process.

The lack of the explicit “plural form” in both machine languages andmost HLLs blocks (1) having dialogs between programmer and compilerregarding the parallel properties of the algorithm as well as (2)addressing parallel properties of complex codes (midlevel and macroparallelism) whose parallel properties cannot be deduced by the compilerbut need to be explicitly implemented by the programmer. Presentlyparallel operations may be done, for example, by assigning paralleltasks to different code threads, see CC++ PARAFOR where each iterationof a “PARAFOR” creates a new thread which executes in parallel with allother iteration bodies. Existing ILP register machine and theirpredecessors may either convert micro-parallel actions into threadstructures, appropriate for macro parallel operations but cumbersome formicro parallelism as is the case of PARAFORE. The compiling processremoves micro parallelism information and convers the information into astrictly singular (sequential) machine language form. In case of Vectorand VLIW machines, the parallelism information is strictly used in thecompiler to directly control very specific vector or VLIW hardwarestructure(s). Those structures may be a good fit for processing microparallel applications, but they also may produce clumsy code that ishard to debug and very hard to transport.

SUMMARY

Computing devices and methods of computing are described. Computingdevices may, in various embodiments, include a processor (e.g., CPU) andcomputer memory. In an embodiment, a computing device with multi-layercontrol: mentor layer and instruction/control layer includes a memoryand one or more functional units. The computing device may include aprocessor configured to implement a multi-layer control structureincluding a data structure layer including a local high speed memory, amentor layer, and an instruction/control layer. The local high speedmemory includes one or more variables. The mentor layer includes one ormore mentor circuits. The mentor circuits control actions associatedwith the variables in the local high speed memory and associated, othercache(s), main memory(ies), communication channel(s) or instrumentationdevice(s). The instruction/control layer includes one or more circuitsthat interpret instructions or control operations by one or morefunctional units. In some embodiments, the local high speed memoryimplements a frame/bins structure.

In an embodiment, a method of computing with multi-layer control (mentorand instruction interpretation/controls) includes managing, by a mentorcircuit in a processor, one or more variables in a local high speedmemory (and other associated data locations), performing, by aninstruction interpretation/control circuits, one or more instructions orcontrol of one or more operations by one or more of functional units ofthe processor.

In an embodiment, a computing device includes a main memory, and localhigh speed memory, one or more functional units, and one or moreinterconnects. Local high speed memory implements a frame/binsstructure. The local high speed memory includes a plurality of frames,each frame including a physical memory element. Bins are distributed inthe frames. Each bin includes a logical element. Functional unitsperform operations relating to one or more variables stored in the bins.

In an embodiment, a computing device includes a main memory; a localhigh speed memory comprising one or more bins, one or more functionalunits, one or more interconnects between the main memory and the localhigh speed memory; one or more interconnects between the local highspeed memory and the one or more functional units; and one or morementor circuits. The each of the bins stores a Variable. The functionalunits perform operations relating to Variables stored in the local highspeed memory. The mentor circuits control operations relating to atleast one Variable stored in at least one of the bins. In oneembodiment, a method of computing includes managing, by a mentor circuitin a computing device, one or more Variables contained in one or morebins of a local high speed memory; and performing, by the computingdevice, one or more instructions or control of one or more operationsusing one or more of the Variables managed by the mentor circuit.

In an embodiment, a computing device includes a main memory; a localhigh speed memory; one or more functional units, one or moreinterconnects between the main memory and the local high speed memory,and one or more interconnects between the local high speed memory andthe one or more functional units. The local high speed memory implementsa frames/bins structure. The local high speed memory includes aplurality of frames, each of at least two of the frames comprising aphysical memory element; and a plurality of bins distributed in theplurality of frames. Each of the bins includes a logical element. Thefunctional units perform operations relating to Variables stored in thebins, each of the Variables including one or more words

In an embodiment, a computing device includes a memory structure storingone or more Variables; and a logical mentor. The logical mentor isassigned to at least one of the one or more Variables and performsaddressing operations with respect to the Variables to which it isassigned. In an embodiment, a method of computing includes storing oneor more Variables in the memory of a computing device, assigning alogical mentor to the Variables; and performing, by the logical mentor,addressing operations with respect to the Variables.

In an embodiment, a computing device includes a memory storing one ormore Variables, and information relating to the singular/plural natureof at least one variable and/or algorithm, one or more functional units(Language Unit). The functional units receive the singular/pluralinformation and perform one or more operations using at least one of theVariables using the singular/plural information. In an embodiment, amethod of computing with plural information includes storing, in amemory, one or more Variables, storing, in a memory, informationrelating to the singular/plural nature of at least one algorithm;receiving at least a portion of the singular/plural information; andperforming, using the singular/plural information, one or moreoperations using at least one of the Variables. In one embodiment, amethod of computing includes linguistically implementing, by one or morecircuits, plural-form instructions comprising one or more threads. Eachthread may be a set of one or more programs. Each thread may beassociated with one or more Variables such that the thread can beassigned plural and robustness properties relating to its interactiondiscipline(s) with other threads.

In an embodiment, a computer processor includes an operands-mappednamespace and/or a Variables mapped namespace. In some embodiments, asystem for performing computing operations includes a processorcomprising a namespace; and one or more memory devices physically orlogically connected to the processor, wherein the memory devicescomprise memory space. The namespace of the processor is not limited tothe memory space of the one or more memory devices. In an embodiment, amethod of computing includes physically or logically connecting aprocessor to one or more memory devices comprising memory space, andimplementing, by the processor, a namespace, in which the namespace isnot limited to the memory space to which the memory space is physicallyor logically connected.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates one embodiment of computing device implementing amentor layer.

FIG. 2 illustrates memory bandwidth requirement reduction usingprocessor teams.

FIG. 3 illustrates data structure elements in one embodiment.

FIG. 4 is a diagram illustrating a frames/bin structure.

FIG. 5 is a diagram illustrating a word arrangement in blocks andframes.

FIG. 6 is a diagram illustrating a data structure with frame/bins in ahigh speed local memory.

FIG. 7 is a diagram illustrating a functional unit and associated tarmacregisters.

FIG. 8 is a diagram illustrating crossbar notations in one embodiment.

FIG. 9 is a diagram illustrating one embodiment of a bins/framesinterconnect to and from main memory.

FIG. 10 is a diagram illustrating a crossbar interconnect from frames tofunctional units.

FIG. 11 is a diagram illustrating functional units to frames/bininterconnects.

FIG. 12 illustrates a memory addressing circuit.

FIG. 13 illustrates a functional block diagram of a processor includinga dynamic VLIW program flow control level and a mentor circuit controllevel.

FIG. 14 is a functional diagram of a mentor circuit in one embodiment.

FIG. 15 is a functional and interconnect block diagram of mentorcircuit.

FIG. 16 illustrates a mentor/bin to functional unit command transferformat.

FIG. 17 illustrates a virtual mentor file (VMF) format for dimensionedelement (array).

FIG. 18 illustrates a virtual mentor file (VMF) for mentor holdingsingle variables and constants.

FIG. 19 is a block diagram illustrating dynamic VLIW control.

FIG. 20 illustrates a VLIW instruction format with sequence control.

FIG. 21 illustrates data structure control of VLIW type 0000.

FIG. 22 is a block diagram for an implementation of a Dynamic VLIWinstruction issue circuit.

FIG. 23 illustrates an example of a DONA indexing formula.

FIG. 24 illustrates an example of a DONA main algorithm code.

FIG. 25 illustrates a content-addressable memory functional unit.

FIG. 26 is an operational flow diagram of a simple relaxation algorithmusing array processing with multiple ADD and MPY functional units.

FIG. 27 illustrates an example of work flow in synchronized hardware andsoftware development based on the use of C++ as software migration base.

While the invention is described herein by way of example for severalembodiments and illustrative drawings, those skilled in the art willrecognize that the invention is not limited to the embodiments ordrawings described. The emphasis in the examples is to show scope of thearchitecture, not to present preferred implementation(s). It should beunderstood, that the drawings and detailed description thereto are notintended to limit the invention to the particular form disclosed, but onthe contrary, the intention is to cover all modifications, equivalentsand alternatives falling within the spirit and scope of the presentinvention as defined by the appended claims. The headings used hereinare for organizational purposes only and are not meant to be used tolimit the scope of the description or the claims. As used throughoutthis application, the word “may” is used in a permissive sense (i.e.,meaning having the potential to), rather than the mandatory sense (i.e.,meaning must). Similarly, the words “include”, “including”, and“includes” mean including, but not limited to.

DETAILED DESCRIPTION OF EMBODIMENTS

Architecture

In various embodiments, a variation/enhancement on Von Neumannarchitecture expands the architecture's namespace in two ways. The firstway is by providing compatibility with software's and HLL “conceptualunits”, and the second way is by including in the namespace not onlymemory but also any and all other relevant data interconnects. In aclassical Von Neumann architecture, both program and data are in asingle memory that is the processor's namespace. In embodimentsdescribed herein, the Von Neumann namespace definition is expanded byhardware mechanisms mapping software's “conceptual units”, the“Variables”, to form the full scope of computer's data connectivitythrough the Variables not only to memory but also to I/O,communications, and instrumentation data access, all this while freeingone from the need for hardware specific means (Registers, Stacks, Vectorregisters, means that can, in some cases, limit performance and createidiosyncratic namespace distortions.) Embodiments as described hereinmay implement a processor architecture framework that inherently takesadvantage of the multiple copies+spar(s) technology.

This disclosure shows what the device in the embodiments is doing andthen is show how it is doing it.

First, with respect to what the device is doing:

Consider the statement by John Backus in the DARPA 2008 supercomputersreport page 62:

-   -   “Surely there must be a less primitive way of making big changes        in the store than by pushing vast numbers of words back and        forth through the von Neumann bottleneck. Not only is this tube        a literal bottleneck for the data traffic of a problem, but,        more importantly, it is an intellectual bottleneck that has kept        us tied to word-at-a-time thinking instead of encouraging us to        think in terms of the larger conceptual units of the task at        hand. Thus programming is basically planning and detailing the        enormous traffic of words through the von Neumann bottleneck,        and much of that traffic concerns not significant data itself        but where to find it.”

As noted by Backus, a significant limitation of present processors isdue to the fact that while software, communications, instrumentation andsystems are designed in terms of “conceptual units” (Variables)consisting of; single variables, arrays, lists, communication frames,queue, and more complex structures, processors deals with single wordsin memory as data items or individual instructions.

This conceptual discontinuity between the processor proper and system's“conceptual units” is a key to understanding the causes for manylimitations of present IPL processors including limit to processorperformance (ILP Wall), low robustness, security issues, excessivesystem complexity and associated drop in performance due to thiscomplexity, communications and instrumentation elements are not part ofthe processor model, as well as many other limitations.

Using the architecture as described in some embodiments of the presentdisclosure, the software program deals with “conceptual units”(Variables). The hardware takes the responsibility of mapping“Variables” to physical words and physical gating and Functional Units(FUs). The assumption of the “Variables” management responsibility bythe hardware and the integration of robustness capabilities like boundchecks in the hardware leads to advantages in performance, softwarerobustness, multiprocessor work teams, system robustness, systemsimplicity, and the application of the basic processor model not only todata in memory but also to communications and instrumentation elements(communications and instrumentation Variables are noted as “InfiniteVariables” in contrast to “finite Variables” residing in memory). Theabove discussion addresses, in general terms, “what” is the devicedoing.

With respect to how devices as described herein perform their functions,the devices introduce, in various embodiments, two new elements intoprocessor architecture.

The first element is the Frames/Bins element that replaces theRegisters, Condition Codes, L1 data cache, instruction cache, shadow andvector registers of conventional design.

The second element is the Mentor layer. The mentor layer is a newcontrol layer positioned between the instruction interpretation controllayer and the data structure. The Mentor layer is a set of one or moreMentor circuits that receives control activities in terms of logicalVariables and translates it to control activities in terms of physicaldevices and physical data locations in the Bins and memory.

The following includes an overview of the hardware embodiments,including discussion of several advantages of the devices. Many of theadvantages not explored herein may be revealed in specificimplementation details such as silicon technology used (logic designtechnology, PIM technology) spares strategy, type of package and numberof pins, etc.

In some embodiments, the processor architecture accommodatescommunications, inter-processor links, instrumentation's and I/O's thatneeds to handle “infinite variable” representing access to data elementsoutside of memory.

In some embodiments, mentors of shared variables address limitations inmacro-parallelism by providing an efficient mechanism responsible forthe integrity of shared variables.

In some embodiments, the architecture model is based on a namespacedefinition and two-layered access to operand structure that thedefinition entails.

The namespace may be: Variables+PC (Program Counter).

In this embodiment, by placing the PC of the Von Neumann architecture ina Memory location (for example address location “0”) the Von Neumannnamespace is just [Memory].

The corresponding assertion, when applied to this embodiment holds thatby defining a specific Variable as the PC, the computing element'snamespace is just Variables.

In this case, the only names that may be seen by the applicationalgorithm are the Variable names; this includes the program files asthose files are also Variables. Instruction interpretation in the firstarchitecture level operates upon Variables. Variables may be arrays,code segments, lists, individual words, or data elements outside Memoryaccessed through communications or instrumentation interfaces.

The read may recognize that our definition of “Variables” is similar indefinition to the term “Object” in “object oriented” systems such asC++. Data is not just a segment of words in Memory but a combination ofthe segment of words with a set of properties and rights. The propertiesand rights may be explicitly defined by an accompanying descriptor ormay be deduced by the system from the data segment itself and othersystem information.

Therefore, for insight one may view the computing element's namespace asa namespace of “objects” and the described processors as animplementation of object oriented computing device. However, as will befurther discussed some of the elements, such as Infinite Variables,Variable Mentor File (VMF) are not part of present models of objectoriented systems such as C++.

Furthermore the computing device described herein, is efficient in microparallelism which is very inefficient in systems that assign threads tomicro parallel actions (see C++ FORALL).

Thus while Variables and Objects may look similar, the definition ofobject oriented architecture in C++ includes specifics about: object,class, abstraction, encapsulation, inheritance, polymorphism, etc. Theimplementation of Variables in this computing element (1) has Variabletypes (classes) that are not included in the C++ or similar objectoriented sets (Infinite Variable, VMF) and (2) the computing elementdescribed herein may choose similar definitions for Variables as in anexisting object oriented, for compatibility and software transportreasons, or it may choose totally different set of characterizations,rights and properties in the computing element's language and Variables,or a combination of the two approaches. Therefore for the rest of thedocument the term Variables is used.

The second control level in the architecture may be referred to hereinas the mentor level. The mentor level is responsible for all operandaccess (operand read and operand write). According to the Variable'sdefinition, the operands may be in Memory, or they may be outside Memoryin communications or instrumentation environments. Thus the “namespace”scope of the architecture has been increased from “main memory” toanything that is named as a Variable.

FIG. 1 (see also FIG. 13) is block diagram illustrating a dualcontrol-level architecture in one embodiment. The top level is similarto processor control circuit containing basically two parts, theinstruction decoding part and the instruction issue part.

However, in various embodiments described herein, the instruction issueis generated in terms of Variable and not in terms of end operandactivity. It is up to the Mentor level to translate Variable addressingto operand addressing. The operands may be in [Memory] or they may beacquired/sent to communication, I/O, processor-to-processor links, orinstrumentation elements supervised by their respective Mentors.

The namespace architecture may be upward compatible to a Von Neumannarchitecture namespace definition. All Variables in [Memory]+PC areaccessed through the Variables+PC namespace. However the Variables+PCnamespace may also access data that is not in [Memory]. Whileconnectivity to data outside Memory is standard in most systems, thisdata transfer work is done outside the scope of the basic architecturemodel. In contrast, the architecture model in embodiments describedherein may include all of the information accessed by the computersystem.

Like some present processor ILP control structure, a processor in theimplementations described herein may interpret multiple instructions percycle and issues control signals to the underlying data structure inorder to execute the instructions. However the operand namespaceaddressed by the top level control is not in terms of registers andmemory addresses but in terms of Variable names where a Variable may be;a single word, array, list, queue, program file, I/O file, etc.

The second control layer is built from a set of Mentor circuits. TheMentor circuits are responsible for mapping the Variable's (i.e.“conceptual units”) ID (Variable names) to the appropriate Variableoperand (specific array words) in order to present the operands to thearithmetic and logic functional units. The Mentors know (e.g., maintaininformation on) the Variable's type, (word, byte, etc.) memory locationand dimensions and may be responsible for the Variable's cachemanagement and coherency issues.

In register architecture, the compiler may receive from the HLL programinformation regarding the “conceptual units” in terms of Variable'sproperties. However, the information is not typically transferred to thehardware as the hardware does not have means to understand it. TheMentor structure does understand this information and can take advantageof it. The inclusion of “Registers” into the processor's namespaceimpedes the hardware from taking advantage of the algorithm'sparallelism in two ways; the first as stated above the information aboutparallelism is not available to the processor, this issue will bediscussed later under “plural form”.

The second issue is that the use of “Registers” in the instruction setturns parallel processes to serial processes due to instruction setrequiring that operands are staged through a register (scalar or vector)on the way from memory to a FU and again on the way from the FU tomemory. This staging, while when originally introduced significantlyreduced memory traffic presently may pose traffic bottlenecks.

The following example details the problem:

BEGIN ADD-ARRAYS; DO P;

D(I)=A(I)+B(I)+C(I);

END DO; END ADD-ARRAYS;

A RISC “register” machine language equivalent of ADD-ARRAYS is:

-   -   R7<=Mem [A array base]    -   R8<=Mem [B array base]    -   R9<=Mem [C array base]    -   R10<=Mem [D array base]    -   R11<=Lit “1” Comment: The value 1 using the literal field.

L1 R12<=Mem [R7, R11] Comment: R7+R11, R11 base, R7 Index.

-   -   R13<=Mem [R8, R11]    -   R13<=R13+R12    -   R12<=Mem [R9, R11]    -   R13<=R13+R12    -   Mem [R10, R11]<=R13    -   R11<=R11+Lit “1”

L2 R12<=R11−Lit “P” Comment: R12 register reused in branch test

-   -   BRANCH-ON-NEGATIVE TO “L1”

The potentially highly parallel HLL algorithm has been converted to asequential process in the Register machine language code. All elementsof arrays A and C are passed through a single register R12. Thus R12 hasmultiple uses in the machine language code. First it is used for thetransport of array A and C operands. R12 is later used for ConditionCode checking for loop termination by line L2. This practice is known as“register reuse”.

In compilers optimized for ILP register machine one attempts to avoidsome of the serializing processes. Serializing effects, like registerreuse may be remedied by the compiler, for example by using a different(unused) register rather than R12 for array “C” and converting L2 to useyet a different register:

L2 R14<=R11−Lit “P”

However those corrections do not remedy the basic problem which is thatoperands of an array in a Register machine must serially pass throughthe same “Register”.

An analysis of the HLL source will show that the computation of all theD(I)=A(I)+B(I)+C(I) statements may be done in any order including doingall P iterations simultaneously. There are no operand dependencies amongA, B and C operands and the D array results. However once the code iscompiled to “Register” based machine language all “A” operands mustprogress serially through a single register (R12), all “B” operands mustprogress serially through R13, etc. It is not the mere existence of“Registers” in the Register machine namespace that is of concern, it isthe fact that operand traffic need to go through those Registers insequential order on the way to and from the arithmetic and logic units,an issue which engenders traffic flow problems when micro-parallelism isconsidered.

The Mentors may manage individual cache sectors assigned to theirVariable. One of the Mentors may, for example, be assigned to theprogram thus this Mentor may manage the program cache; the rest maymanage their data cache. The Variables assigned to I/O, communicationsor instrumentation will manage the appropriate protocols and assignedcache. In addition to managing the cache for each Variable, the Mentorsmay contain bounds checks and other mechanisms that enhance bothsecurity and program debug feature to protect Variables' integrity andassist in program debug.

This approach of handling arrays may be different than the one ofincluding vector processing where the compiler omits the entire array'soriginal information by transforming the HLL's array information to theone word scalar and 64 word vector registers, terms that the machinelanguage understands.

Both vector structures and the Mentors described herein add complexhardware structure to the basic Von Neumann machine, the difference isthat vector processing requires compiler involvement in internalhardware details, while the addition of Mentors engenders theprocessor's (hardware) understanding about the nature of the program's“conceptual units” and as such software tasks and portability aresimplified.

In terms of the OSI reference model, the architecture of embodimentsdescribed herein moves the hardware/software interface upward toward theHLL application layer. The Mentor layer may manage the Variables' cachesin ways that (1) provides continual operand streams, enabling array(vector type) operations without the unwanted artifact of either vectoror scalar registers or breaking a DO loop into 64 word “chunks” (2)includes automatic bounds and other checks to protect the Variables'integrity for security and debug support; and/or (3) enables includingin the model Variables that do not reside in part or whole in Memory,those Variables include communications, instrumentation, etc.

Use of Variables

In some embodiments, the top layer contains the instructioninterpretation and control layer is strictly dealing with logicalVariables, while the Mentors are responsible for mapping the logicalVariable namespace to the physical memory address space.

In some embodiments, the HLL DO statement is made more effective bydeploying “plural” concepts of the ALL and IMMUTABLE and other additionsto both the HLL(s) and to the machine language OP Codes.

Plural Forms for Computer Hardware and Software Languages

As used herein, “plural forms” may define parallel properties ofalgorithms independent of the specific means or of the amount ofparallelism actually deployed in any particular hardware and/or softwareimplementation. Plural forms may be applied, in various embodiments, inthe context of HLL or machine languages. To facilitate parallelprocessing and portability, (1) the parallel properties of codes(algorithms) should preferably be made very clear and (2) the parallelproperties of code should preferably be stated in a form that isindependent of the specific means or the amount of parallelism deployedin any particular software and/or hardware implementation.

In some embodiments, a processor directly accepts information regardingthe singular/plural nature of an algorithm. A hardware softwareinterface may transfer the “plural form” of information regarding thealgorithm in the machine language or other means. The information suchas DO, ALL, IMMUTABLE, BRANCH-OUTSIDE-THE-LOOP-(OR-CODE)-SEGMENT, END,etc. may be provided in order to take advantage of the parallel natureof the algorithm and in order avoid the “outside of the address spacereach” that is associated with the use of conditional branches inmachines that deploy speculative execution. In addition “plural form”information, as demonstrated by “simple relaxation algorithm” may serveto improve the accuracy of algorithms when modeling naturally parallelprocesses.

In some embodiments, “plural form” is included as part of expressingalgorithms in HLL. In some embodiments, “plural form” is included aspart of expressing algorithms in machine language.

The information regarding plural or singular may allow for separationbetween stating the parallel properties of the algorithm which ismachine independent information and the mechanisms of doing the taskwhich should be done by the compiler through the OS and the machinelanguage in order to use the parallelism in the algorithm and in theprocessor for performance, robustness or other considerations.

In some embodiments, OP code “DO” and “ALL” are used in loop control(instead of using, for example, JUMP and BRANCH). There may be certainadvantages to doing so: First, this information enables the hardware toengage in effective streaming operations, i.e. “vector type processing”without the need to resort to “vector register” or vector instructionsas well as enabling the design to perform bounds checks for both fetchand store operations.

BRANCH instructions typically use speculative branch prediction to speedup execution. However in loop control process DO and ALL may replaceBRANCH. When using BRANCH (and branch prediction) the program flow mayoverreach at the last loop iteration(s) addressing operand(s) in memorywhich are typically in the zone belonging to the an element placed inmemory next to the array. Thus, toward the loop end, the “speculative”memory addresses may reach outside array bounds. In some cases,speculative look-ahead associated with BRANCH may be an obstacle to bothperformance (the last operations need to be undone upon miss prediction)and cause difficulties in implementing automatic array bounds checkssince the instruction execution mechanism does out of bound reads as amatter of course during speculative execution.

Instead of just having Conditional-Branches in the instructionrepertoire, having a “DO” as well as “ALL” instructions, which containsthe Index parameters as well as having all array parameters available tothe hardware enables the architecture to makes sure that the DO (or ALL)loops never fetches operands beyond the Variables' range and stilloperates at maximum performance.

Thus, while pursuing the direction of providing the hardware the mostrelevant information for effective program processing by using powerfulinstructions sets, methods such as those described herein maylinguistically provide better information by implementation of added HLLlinguistic concepts. Some examples are provided herein for HLL DOcommands (ALL, IMMUTABLE code). The addition of the information to HLLand to the code may promote higher performance, bounds protection andbetter code debugging assists.

Stated differently, in addition to the three “walls” the power wall,memory wall and ILP wall, there are limiting factors to processingcapabilities due to the fact that important information regardingproperties of array and other Variable type conceptual units are notstated in present register machine codes.

Present HLL software uses DO for both sequential and plural operations.One can see the true sequential reason in implementing a Fibonaccisequence:

A(1)<=3; A(2)<=6; DO i=3, 50; A(i)<=A(i−1)+A(i−2); END DO;

In the Fibonacci example above there is a truly sequential relationamong operands. One cannot compute A(i) prior to A(i−1) and A(i−2) beingpresent.

However the in most present HLL linguistics, the commands DO or FOR arealso used in plural operations (with PARAFOR and FORALL being notedexceptions).

For explanation, comparing this to military commands in English, onedoes not have the language words to say “company about face” one canonly say “soldier I to N, DO an about face” where it is unclear whetherthey should “DO the about face” one after the other (wave) or they maydo or they may the “about face” all in parallel or in combinations.Consider the following:

DO i=1, N; A(i)<=B(i)+C(i); END DO;

Where DO (or FOR) is the way one typically states in HLL the requirementto perform ALL the above N additions. The ALL HLL word and correspondingALL OP Code state that one may perform the operations all in parallel orin any parallel/sequential order one may select.

In introducing ALL and the corresponding ALL OP Code as in

ALL i=1, N; A(i)<=B(i)+C(i); END ALL;

we have removed the ambiguity between sequential operations and pluraloperations.

The removal of this ambiguity allows the hardware to best deploy itsresources for maximizing performance and for bounds protection.

The use of IMMUTABLE further expands the ALL concept to assist duringprogram design where data “version control” is required in order to useplural Variable properties for better program design and providing moreaccurate results (See Simple Relaxation example).

The ALL and IMMUTABLE HLL and machine language capabilities areimportant but are not a prerequisite to the architecture's use of theVariables and associated Bin and Mentor implementation. Present registerarchitectures are limited in their abilities to deploy parallelactivities, specifically of the class described as micro parallelism;therefore the issue of expressing the parallel capabilities in HLL hasbeen marginalized. Proper handling of micro parallel activity may bebest handled by cooperation of software and hardware which requireslinguistic upward changes to include HLL and machine languages thatinclude the plural properties of both Variables and algorithms.

As a further illustration, consider thread linguistic access propertyORDERD-QUEUE for multiple requestors accessing multiple resourceservers.

ORDERED-QUEUE may be implemented by a QUEUE-CLEARK where a requestor isassigned by the QUEUE-CLEARK to a particular server and is put on aqueue there (SS office method), the requestor judges queue length andselects a queue (supermarket), the requestor tears of a tab thusreceiving a queue number and waits until the number is called (DPSmethod), etc.

The thread access discipline is linguistically ORDERD-QUEUE, how toimplement ORDERD-QUEUE is an implementation decision.

Defining powerful OP codes for handling “powerful conceptual operations”involves the subject of array and loop indices. In some embodiments,loop indices are considered as parameters of “DO” (or FOR) instructionsrather than Variables (in a similar way that the number of bytes is aparameter in a “byte stream move” instruction). Though, indices andVariables may certainly exchange information.

While some namespace architecture approaches may accommodate additionalextensions as in the case of “Memory+PC+Registers+PSDW” and“Memory+PC+Registers+Vector Registers+PSDW”, in most instances this isnot the case. Some previous attempts to contain both Registers and STACKin the same architecture failed as the “Registers” became dominant.Similarly it may not be useful to extend the namespace beyond“Variables” however, based on past experience, other types of namespaceextensions will be marginalized.

The architecture as defined herein however does not require excludinghardware defined elements as part of the namespace or the architecture,it only requires that the Mentor handling the element understands whatthe element is and handles them as Variables, to with a Mentor handlinga gigabit Internet channel.

Restating the register in the namespace issue differently, a “register”in hardware is a set of F/Fs. A “register” in ILP machine language ispart of the namespace. This namespace element, though inspired byhardware registers, is none the less a totally different entity,typically implemented in ILP machines using register(s), shadowregisters and a host of logic elements. “Does 1600 Pennsylvania Av.agree with 10 Downing St. or is it aligned with Elysee palace . . . .”is surly not a discussion about buildings' brickwork.

Embodiments described herein may provide the ability to address“infinite Variables”, system data elements that are not in Memory.Applications in which may include:

(1) High speed communications channels, whether they handle Internet,SONET or other protocols deploy “infinite” data streams. A processingelement whether it is called a router or switch continually processesthe incoming stream, adds and/or subtracts portions and forwards theoutcome to the output stream. Parts of the information may or may not beplaced in memory.

(2) Instrumentation. The extended concept of Variables as describedherein may allow inputs for sensor and outputs to actuators to beincluded without any requirement that the corresponding data is placedin memory (by means that are outside of the current “processor” model).In this context, a requirement is that the mechanism responsible for theVariables (the Mentor) knows how to send settings to actuators andobtain sensors' readings.

The Variable may be an “infinite Variable” and the processor hardwareproviding the needed operational buffers and protocol control logic.Some of the Variables used in communications or instrumentation may havepartial information placed in main Memory or may have no residenceinformation placed in main Memory.

Providing in the basic model mechanisms such that information need notbe staged through main Memory has a potential of significantly reducingvolume traffic through Memory, thus may significantly also solve orreduce the effects of the “Memory Wall” limitation.

As an example, assume that two processors are assigned to the task (ortwo sets of processors if the task is broken to array segments and eachsegment is assigned two processors). FIG. 2 illustrates memory bandwidthrequirement reduction through the use of processor teams. The firstprocessor accesses array A in Memory and produces A1. A1 is defined asan “infinite Variable” and transferred via processor to processor linkto the second processor (or set of processors) which buffer the contentof only three rows of A1 data in internal processor buffers.

This example demonstrates two aspects of the use of “infiniteVariables”. The first aspect is the reduction of demands on main Memoryboth in size of Memory space needed and in main Memory bandwidth (whenthe processor internal buffers can handle the intermediate results). Thesecond aspect of this example is the general use of “infinite Variables”for forming “production line teams”, where each portion of the task isdone by a processor set. Results are passed to the next set in theproduction line requiring only the minimum amount of memory/hardwareregisters space to stage buffering. For a full example of the“production line” approach consider airplane design system where one setof processors converts the mechanical drawing tables to finite elementsdata tables, the second set performs the finite elements analysis, and athird set interacts with the designer(s) through graphic interface(s).

Processors as described herein may include a computer core architecturethat produces improvement in the following areas:

(1) Performance: An architecture overcoming the performance barriers,such as the “ILP Wall” characteristic of some existing register-basedarchitecture cores.

(2) Hardware resilience: The architecture, based on spare sectionreplacement strategy enables high manufacturing yield, low cost, andfield self-repair capabilities.

(3) Software robustness (e.g., a general measure of to how long it takesto debug a program or a system and operation time duration betweenprogram or system failures.) Embodiments as described herein may includea set of advances “built in” hardware elements that provides systemsemantic coherency providing debugging support and error exclusionmethods for system debug.

(4) Compatibility. In many aspects one will find compatible, operativeor conceptual to powerful software systems concepts such as “objectoriented systems”. In some embodiments, in order to facilitate andshorten system/program debug time, a device provides built-in softwarecompatibility and migration abilities that eases the transportation andallow “AS-IS” use of existing machine language applications.Specifically, one will not, in “compatibility mode” be able to get theadvantages of all the new capabilities mentioned above, but will be ableto run machine language programs AS-IS.

Other advantages of embodiments described herein may include:

1. Instruction set architecture is implementable for both high endprocessors as well as for the workstation or personal device (cellphone)versions. The high end example presented in the following presentationmay deploy Processor-In-Memory (PIM) large footprint silicon technology.Note that the architecture described in this document has, in someembodiments, a Memory interface of four 512 bit words, a design thatwill require several thousand pin package if implemented as a singleprocessor on a die. The PIM implementation allows this very wideinterface for example for a PIM module with (internally) 8 processorsand large Memory per die. Another implementation version (describedunder the heading Alternate Embodiment #2_in this description) istargeted to use a smaller footprint and most important a limited numberof I/O pins package whether the design is implemented in PIM type(limited number of metal layers) or in the higher number metal layerstypical of silicon processor logic technology.

The implementation produces processors that may outperform theirregister ILP counterparts due to parallel operation of addressing anddata ALUs and due to the removal of the artificial serialization causedby the registers in the processor's namespace of ILP processors.Processor performance is critical, as discussed earlier; going down inperformance just does not work in the market place. Thus thearchitecture has the same advantage that catapulted the IBM-360, theinstruction set architecture that fits the “processor family” concept,producing instruction set compatible low end processor that fits thegeneral requirements of supercomputers as well as workstations, PCs,personal devices, graphic devices and instrumentation devices (DSPs,etc.). Thus program developers may develop or modify programs usingtheir workstations with only occasional checks on institution's high endsupercomputer machines.

2. The Variable namespace described herein may allow for automaticbounds checks and other program debugging supports.

3. Using addressable elements as seen by the application programVariables. The Variable arrangement allows one to associate hardwarevisible properties with the Variables, properties such as “to be used byone processor or processor-task-team at a time” and come up with muchmore efficient, less malware intrusion susceptible or error prone,methods of handling the distributed memory.

4. By making “Infinite Variables” part of the basic system model,processor-to-processor links may be formed without the need to depositall information in Memory. Similarly raw instrumentation data can beprocessed and filtered before depositing relevant information in theMemory.

In embodiments described herein, (outside AS IS emulation mode) there isno “tower of Babel of machine languages among modules”—all the modulescan use the same machine language, the machine language need not containnamespace elements relating to “Registers”, “Vector Registers”, “stacks”or other hardware elements. The machine language in embodiment describedherein may include both the DO op code and the ALL opcode as loopoperators. The use of the ALL OP code may enable a compiler to indicateto the module that the computations in the loop are immutable, inputoperands are not changed by computations within the loop. The module MAYtherefore process the loop iterations and sub elements of arithmeticstatements in any order.

The processor thus may deploy performance enhancements using multi wordoperations (as will be described in the following sections when thedesign is using 8 operand sets) presently seen as Vector operations,multithreading or other parallel means. A user developing a model on aGPP, processor may see an order of magnitude performance enhancement bymoving the model to a single PIM based processor.

The hardware mechanism in the embodiments described herein may deploythe IMMUTABLE program micro-parallelism and be in charge of all theassociated data buffering. Furthermore the hardware may choose toprocess large arithmetic expressions by “Vector type grouping” or byprocessing one expression at a time. Consider the expression:

-   -   DO I=1, 10000; A(I)=(B(I)+C(I))*(D(I)+E(I)); END DO;

For example, in Vector grouping a set of 256 (B(I)+C(I)) computationsare performed and stored in BUFFER. Next 256 A(I)<=(B(I)+C(I))*BUFFER(i) are performed where “I” is the expression index and “i” is theBUFFER index. The hardware takes full responsibility for Vector typegrouping and BUFFER size, BUFFER does not show up in the software andthe software need not map the algorithm to Vector registers as theprocess is similar to cache and virtual memory operation in the factthat it is fully transparent to the program.

The compiler is however entirely responsible for checking theimmutability assertion and will use ALL instead of DO to indicate theloop operation is indeed immutable. More advanced compilers may use the“IMMUTABLE” assertion in the code to include version control or othertechniques to provide for program section immutability (see SimpleRelaxation). Similarly the compiler and other software tools may analyzeand provide macro parallelism enabling multithreading of the code.

To provide compatibility among the various modules, all modules mayrecognize all family data types and perform all programs at least insequential pace. This may include some graphic data types, or in case ofa C++ compatible computing device design all the C++ Object orientedstructures including objects, class, abstraction, encapsulation,inheritance, polymorphism, etc. The definition may include what graphicoperations are done by the nodes in the “processor logic” or PIM siliconenvironment and what operations are parts of the rendering hardware thatis included in the display. As was indicated before, different thancurrent processors, I/O Variables are part of the Variable namespace inall implementations whether it is the PIM environment or processor busenvironment and whether the information is deposited in Memory or not.The replacement of the single level control by a dual level control andthe replacement of “[Memory]+Registers+PC+PSDW” with the “Variable”namespace thus may provide an architecture that is applicable to adiverse range of processors.

Operand Traffic Strategy and Processor Performance

In some embodiments, a processor implements a multi-layer architecture.FIG. 1 is a diagram illustrating one embodiment of a processor includinga three-layer-architecture. Processor 100 includes Data Structure layer102, mentor layer 104, and instruction interpretation and control layers106.

In this example, the hardware architecture is built in three layers. Thebottom, Data Structure layer is dedicated to “end operands” operations,(floating point operands in floating point operand representationprograms) what Backus calls “significant data itself”.

Data Structure Layer

The Data Structure layer contains:

(1) the local high speed storage (which is the equivalent of theregisters, cache, vector registers and shadow registers in aconventional design),

(2) the functional units (mostly in groups of 8 identical units+spare),

(3) the multiplexing and gating structures connecting main memory, localstorage and functional units and

(4) any traffic matching buffers needed to align operands when datablocks (of 8 words) are either aligned to block boundary or convertedto/from single word to block format. (Such buffers are used for examplein conventional systems for “traffic matching” between main memory anddata caches or instruction caches as data from main memory to cache istransferred in data blocks.)

The arithmetic manipulation and the traffic associated with formingoperand addresses traffic concerns—not significant data itself—butrather where to find it—that task is assigned to the next layer, theMentor layer.

The top layer includes the Instruction interpretation and the programflow control section. It is understood that data traffic thus exists notonly within a layer but also among the layers. In the description below,the first layers to be addressed is the traffic within the bottom (DataStructure) layer and then the discussion moves to addressing control anddata traffic among layers.

In some embodiments described herein, the software/hardware interface,(the DONA instruction set), are intended to be as devoid as possible ofthe internal hardware structures in the aim that the DONA interfaceshould be “encouraging us to think in terms of the larger conceptualunits of the task at hand”.

Data Structure Mission Statement (PIM Embodiment)

The mission of the Data Structure is:

(1) To provide an effective main memory interface, local high speedmemory and functional unit 8 words wide “end operand” manipulationfreeway structure.

(2) Use the identical copies plus spares strategy.

(3) Design a Data Structure specifically optimized to directlymanipulate Variables which is the term that corresponds to Backus's term“conception units”.

Main Memory, Registers, Cache and Operand Staging Traffic

Observations:

(1) Main memory is typically at least order of magnitude slower inresponse than the operand flow cycle time used for keeping the pipelinedFunctional Units busy.

(2) An operand just computed is much more likely to be required in nextcomputation than a random operand in memory.

(3) Most computations involve arrays where one can typically anticipate(speculate) which operands are needed before the operands are used, sooperand traffic should be staged through a very high speed localstorage, preferably implemented using simple single cycle time devices.In a typical ILP register machine design, this local storage comes inseveral categories, the usual types are: machine registers, shadowregisters, vector registers, L1 data cache, and instruction cache.

Operand Traffic Considerations

To overcome the speed disparity between the machine cycle time and mainmemory a high speed, preferably one cycle access time may be used in thearchitecture. In some embodiments, this high speed internal memory is inthe form of a Frames/Bins structure. This structure may be relevant toseveral different contexts as described below:

(1) The first is that the local high speed memory (acting as a cache) isthe basic remedy for (main) memory access delay.

(2) The second is that the local high speed memory is a multipliermemory of bandwidth, as operands already in the cache need not come frommemory.

(3) The third item is providing the underpinning elements for handlingoperands in terms of Variables (the conceptual units) rather than justword operands.

(4) The fourth is that, through the Variables construct, the machineprovides operand coherency as part of software robustness. As oneaspect, operand coherency assures that for read and writes operationsfull array bound checking is performed.

(5) Speculative execution support. One need not add shadow registers andthe like in order to support the storage for the undo operations thatare part speculative execution such as branch prediction and out oforder FU operations.

Contribution of operand staging in a high speed local memory thus fallinto two domains, one is an increase in the speed of operand access; theother is as a multiplier of operand bandwidth (volume of operands)available to the functional units. The speed gain is proportional to thenumber of cycles of main memory access versus cache access. If memoryaccess is N cycles the cache improvement in operand access speed is upto N to 1. Bandwidth Multiplier however is even more impressive as itmay raise operand bandwidth by a larger factor, a very important factorfor compute intensive programs when both program and data may fit insidethe local high speed memory; in current ILP register machines the roleof this high speed memory is done by the “cache” and/or by the cachehierarchy, the L1, L2, L3 caches.

Speed gain: The effects of a local, one cycle memory access time areobvious, performing the original three address Von Neumann typeinstruction (for example the SWAC machine) required four memoryaccesses, one for reading the instruction, two for reading operands andone for storing results. The introduction of any form of Local HighSpeed Memory (LHSM), originally Accumulator (ACC) in the IBM 7094 andUnivac 1108) then “registers”, then caches then shadow registers in ILPmachines, all of the above are methods for boosting performance byincrease access speed through the use of the LHSM.

Bandwidth multiplier: Especially in array or lists operations, thepresence of the LHSM is mainly needed in order to supply very highoperand bandwidth to/from multiple of pipeline stages for multiplesimultaneous operations or for use of multiple copies of FunctionalUnits. The bandwidth multiplier is due to the fact that recently usedoperands are already in LHSM and when array operands are used the nextset of array operand in the cache can be anticipated or they are recentresults of computations. The computational bandwidth, especially whenusing multiple Functional Units, is directly related to operandsbandwidth. An algorithm may include a given number of “end operand”computations, the higher the “end operand” bandwidth, the faster theprogram completes.

Data Structure Overview

In some embodiments, a processor includes a Frames/Bin structure. Theframes/bin structure may implement a method that supports the handlingof data structures larger than one word, with the “conceptual units”being Variables in this presentation's nomenclature. A Frames/Binsstructure may dispense with the need to use specialized shadow circuitsand directly support “undo” operations fundamental to speculativeexecution. The “undo” capability is mandatory for branch prediction andmay be also used in debugging tools (as it may, for example, present tothe debugging tools the program's history just prior to an exception ora encountering a “debug flag” during program access to a memory or a Binlocation marked by the debug flag), those debug tools may prove to be acapability critical to support high software robustness. The Frame/Binsstructure may enables the expansion of the model and the scope ofVariables to include not only accessing words in Memory but to be usedas a home for handling “infinite Variables” such as communicationschannels, I/O, processor-to-processor links and instrumentation.

In some embodiments, a processor implements a crossbar interconnectschema. The schema may provide a general solution for operand flow aswell as operand word alignment, such that the data structure properlysupports both; single words (scalar) and array (vector) operations aswell as having several available forms of addressing the spares yieldissue needed for high reliability high yield designs.

In some embodiments, tarmac registers are placed and used to aide incontrol synchronization, operand flow and operand alignment. (Byanalogy, your arduous flight just landed, sitting on the tarmac, you arecooped on the plane as a departing flight is yet to leave your assignedarrival gate.)

As was previously pointed out, performance increase by an order ofmagnitude over present ILP designs requires the use of multiple sets of(FADD, FMPY, etc.) functional units. So for meeting performance andspares target the functional units set size is chosen as 9 functionalunits, 8 operational and one spare.

Considering the data bandwidth to deploy 8 FADD, 8 FMPY, etc., the basicdata structure elements; which are the Main Memory, LHSM andinterconnect structure handle the traffic of blocks of eight, 64 bitwords. From a traffic flow point of view, the data structure handles 512bit words operands.

FIG. 3 is a diagram showing an overview of the data structure in certainembodiments. In FIG. 3, arithmetic Functional Units on the right side ofFIG. 3 show sets of 9 identical Functional Units.

In this example, the processor implements LHSM by the Frames/Bins designdepicted in FIG. 4. The Frames/Bins structure may address operand accessvia Variables (conceptual units), operand memory latency, operand flowbandwidth, and software coherency for Variable.

The Frames/Bins Structure

FIG. 4 is a conceptual diagram of the LHSM, the Frames/Bins structure,the center element in FIG. 3. The LHSM may include, in one example, 64one-cycle memory elements, 4K words 64 bit each; each individual memorymay be referred to as a Frame. The Frames/Bins structure forms amulti-ported high speed local high speed memory design where the 4K×64bit words single ported Frames are the physical elements and the Binsare the logical elements. The size of a Bin may range from a single 64bit word to 262,144 (256K) 64 bit words. The content of a Bin largerthan one word is multiplexed across multiple Frames. The specific Framesnumbers and sizes are provided for illustrative purposes only.

Specifically the 6 least significant bits (LSB) of a Bin address selectthe Frame and the most significant bits (MSB) select a word locationwithin the Frame. In FIG. 4 Bin A is an 8 word Bin located in 8 adjacentFrames, from Frame 0 through Frame 7. Bin B is a 128 word Bin spreadacross all 64 Frames. Bin C is a 64 word Bin also spread across all theFrames and Bin D is a one word Bin located in Frame 13. FIG. 5 isanother presentation of the Frames/Bins implementation showing theaddressing arrangement of Bin words inside the Frames structure.

This Frames/Bins arrangement is using single ported memory devices andas a structure the Frames/Bins provides the capability of a multi portedcentral LHSM for the design. The Frames/Bins arrangement also providesthe “home” of all “conceptual units” which are the Variables.

The Frames/Bins may replace all registers, vector registers, data cache,instruction cache, speculative execution shadow registers such as arefound in some existing processors.

The full Frames/Bins address size is 18 bits. Specifically, the 6 leastsignificant bits in a Bin address select the Frame. The upper 12 bit ofthe Bin address selects a word in the Frame.

The advantages of this approach from a data flow model perspective mayinclude:

-   1. The basic Frame element is a standard single ported memory device    where the local high speed memory (LHSM) tasks demands a multi    ported apparatus for the many parallel tasks of local high speed    storage in each cycle. The approach provides for design simplicity,    design flexibility and applicability of the spares strategy.

Regarding spares, the implementation has one or more Frames as spareelements. As to design flexibility example, one for example may use 128dynamic memory devices of one cycle access two cycles per operation, forachieving an overall performance that is not far below the approachusing one cycle devices.

-   2. The Frames/Bins structure may be operated as multiple    simultaneous ports for single (64 bits) words ports and/or parallel    8 consecutive words block ports (512 bits ports).-   3. Operand traffic is self-aligning and may avoid the traffic    blockage problems that occur when two or more ports request access    to the same physical memory device. In this Frames/Bins multi access    structure, once the first logical port is given a single cycle    access, the next port's transfer is from a different set of Frames    allowing the second requestor access within one cycle. The computing    device may be equipped with a small amount of alignment data    buffering (e.g., by way of tarmac registers as further described    herein or other means) as part of each functional unit, the data    buffering is used to self-align traffic in one or more cycles in    order to deliver maximum operand bandwidth.-   4. The data structure may be, in one example, 512 bit wide data    structure that operates using “channels” of 8 word blocks. The    Frames/Bins structure may be capable of simultaneously servicing up    to 8 accesses of 8 word block in each machine cycle. In this case,    the maximum bandwidth of the Frames/Bins structure would be 4096    bits per cycle. The Frames/Bins structure is managed through an    addressing structure that is capable of accepting request from up to    10 “channels”. The addressing structure resolves contentions when    more than one channel requests the same Frame and decides which 8    out of 10 channels are given access each cycle. Memory to/from    Frames/Bins traffic is given priority.-   5. The logical channel allocation may be, in one example, as    follows: four 8-word wide channels handle Bins to FUs operands    traffic. Two 8-words wide channels handle FUs to Bins results    traffic. Four 8 words channels handle Bins to/from Main Memory    transactions.

The Bins as Software's Conceptual Units

Data in the Frames is arranged by (logical) Bins. Each bin representsone or more Variables, the program's logical entities be they a singlevariable, an array, a program, a byte-string file, etc. Each bin thusrepresents one or more software conceptual units (as used in Backus'terminology). The bins may be the homes of individual program Variables.

The operand may be addressed by its Bin address and is assumed to residein the Bin (the Variable's cache). If it is not in the Bin, the Mentormechanism associated with the Bin will fetch the operand from Memory andwill deliver it (after memory access delay) from the Bin.

Each program Variable, including the program file itself has a Bin,which is a local high speed memory which contains all or part of theVariable's data. The full data sets of all the Variables reside inMemory as per this model, (unless they are Infinite Variables) but thisfact is invisible to the program as all accesses to the Variable aredone through its assigned Bin. As far as the program is concerned theVariable, be it a single operand, array, list, program, communication orinstrumentation port while they “exists in main memory storage or thecommunications or instrumentation ports”, are accessed through the Binallocated to them—they “live in the Bin”.

The Bin is a logical structure that is distributed across a set of 64physical Frames. It may be implemented by a different number and/or sizeof Frames or other means than the Frames/Bins structure describedherein, without losing its software coherency as the logical means ofaccessing Variable.

Operational Aspects of Bins/Frames

In accessing the Frames/Bins all requests that have the same 6 LSBs inthe Bin's Cache address during the same control cycle will request thesame Frame, sometimes causing collisions of the requests. A priorityresolution circuit provides access and allows an immediate Bintransaction for this physical cycle for one of the channels in theconflict. If all requests in a control cycle are not honored due torequest conflict(s), the data for the honored request is staged inFunctional Units (or other location) buffers called the tarmacregisters.

In a conflict situation one of the requestors is granted Frames access,in the following machine physical cycle both the first and the nextrequestor are given access to the Frames. The first requestor, that wasalready granted one access is now looking ahead and receiving/depositingits next operand block by accessing operands from the next set ofFrames. This self-aligning process continues until all the requestors inthe current control cycle are granted Frame access. Note that in thelast physical cycle all operands with access Frame conflicts receiveaccess, some for the current and some for future operands (operandstaging). The control cycle is now allowed to conclude by followingthrough the Functional Units pipelined part of the conflicted controlcycle. The Functional Units move their pipeline according to the cadenceof control cycles. Note that this pre staging process that may takeseveral cycles typically occurs once at the loop's onset. Once prestaging is done, the following control cycle will typically require onlya single cycle as all operands are already aligned in the tarmac buffersto simultaneously deliver operands to/from physically different physicalFrames segments.

Definition: A control cycle takes one or more machine cycles and endswhen all requests from the Frame/Bin structures have been granted.

Therefore, a control cycle may initially take several machine cycles inorder to align requests by buffering the operands in tarmac registersahead of/after use. During loop operations the buffering arrangementusing tarmac registers is retained for following loop cycles such thatupon executing the same control cycle the next time, the access to theFrames is already properly aligned and staged ahead of time and in allconsecutive loop cycles control cycles is typically executed in a singlecycle. The consequence of “the traffic is self-aligning” is that thestaging penalty is paid once at the onset of a loop and then the systemstays synchronized such that the majority of loop operations areperformed at maximum throughput where one control cycle duration istypically a single cycle.

The definition of “control cycle” is thus distinct from (physical)machine cycle; this is due to the fact that the operand activities inthe Functional Units and in the controlling “Dynamic VLIW” are dictatedby “control cycles”. (“Dynamic VLIW” to be later explained as part ofdefining the Exa-64 control structure). Control cycles are issued basedon traffic to/from the logical elements Bins which sometimes, whenalignment is necessary due to Frames conflicts or operand position in an8 word block, the first control cycle will typically take severalphysical cycles to conclude. The conclusion of a control cycle statesthat all operands are now available at the Functional Units inputs, sothe Functional Unit pipeline may commence the arithmetic or logicoperation. Similarly upon the conclusion of the control cycle theDynamic VLIW may issue the next VLIW coded instruction.

The mechanism of separating control cycles from machine cycles may takeone or more extra cycles at loop onset. After a few extra cycles at theonset of a loop the system delivers close to the maximum operand flowmade possible by the number of Frames in the LHSM.

The following is a discussion addresses of the circuits in FIG. 3surrounding the Frames/Bins structure. FIG. 6 is a diagram illustratingData Structure interconnect emphasizing the level of interconnect amongthe various elements in terms of “Channels”.

The term “Channels” mainly represent a logical way to showinterconnects. The implementation of the powerful and complexinterconnects is explained after Main Memory and Functional Units arepresented.

Main Memory

Main Memory in this example is a 4 ported Memory system. Memory maydeploy L2 and L3 caches as well as Virtual Memory structure and willtypically connect to multiple processing cores.

Functional Units Tarmac Circuits Alignment and Staging

The Functional Units (FUs) section deploys multiple copies of the samefunctional units. The use of multiple copies of functional unitsprovides high performance and enables high chip yield through sparesreplacement. The Functional Units (FUs) are the standard pipelinedfunctional units consisting of; floating point, fixed point, logic andbyte manipulation units of ILP design plus a new set of units that willbe addressed later in the presentation, however for this section forpresentation assume the standard kinds of arithmetic and logic FUs.

As previously stated, all the FUs are equipped with a set of “tarmacregisters” (see FIG. 7) to align operand traffic, to provide for operandstaging in cases of conflicts among the channels vying for access at thesame cycle, and to “block align” traffic according to the blockalignment of the outputs in operating on 8 word data blocks. In additionto the operand staging that is needed in case of Frame use conflicts,the tarmac registers are used for operand alignment. Specifically whenoperands are used in blocks of 8 words, the input 8 word blocks may notbe properly aligned to the word positions in the results blocks. Thecombination of tarmac registers and alignment circuits position theinput operands to the proper 8 word block position for the results.

Regarding the implementation shown in FIG. 7, Input circuit perFunctional Unit leg enables selection from one of four “channel” busesor one of six bypasses and contains four tarmac staging registers. Theexact number of tarmac register per input or output may be determined byimplementation. The Functional Unit output circuit has four tarmacstaging registers and deposit results on one of two output buses.Bypasses are formed among functional units of the same ilk (Fixed,Floating) and the units are in the same Word (n) in an 8 units bank. Inthis design example the Floating Point Functional Units has six bypassinputs coming from three ADD banks, two MPY banks and one DIV bank, 8units per bank. Other designs may have other numbers of banks per unittype.

The block alignment topic is easier to explain when considering singleword operations. Considering an operation containing three single wordoperands A, B and C for A<=B+C operation. There is a very highlikelihood that the 3 least significant bits of the Bin addresses for A,B and C will not be all the same 3 bit values, so B and C will typicallycome in different “highway lanes” (bloc positions) in the incoming 8lanes channels. Also, the results will need to be sent to a “highwaylane” matching A's position.

In the data structure described herein, the alignment logic may switchthe B and C input to the “highway lanes” corresponding to A at the ADDFU input such that the ADD result will show up in “A's highway lane” forstorage into the A Bin. The Bin storage operation is controlled usingthe Bin address and an 8 bit “write mask” used for selecting the propersingle Frame in the proper 8 Frames block.

Operand alignment and access conflict resolution mechanisms may beimplemented at any part of the data structure system; next to theFrames/Bins, the data exchange channels or the FUs. In this example,traffic matching logic uses the tarmac registers at the FUs. Staging andalignment also takes place in FU bypass circuits where the result of oneFU operation is immediately used as an input to the same or different FUand the intermediate result may not be stored, so the bypass function ismost efficient when it takes place by the FUs. Thus, thestaging/alignment logic is mainly located in this example next to theFUs by using the tarmac registers. Nevertheless, in some embodiments,alignment is implemented and conflict resolution is placed in the datastructure in different places or uses other suitable alignment circuits.

Thus, in various embodiments, tarmac circuit obligations may include:

-   -   The task of supporting staging to resolve Frame use conflicts        and the task of operand alignment.    -   Scalar alignment, in scalar computations where only one word        operand per input is involved; the input operands are aligned to        one of eight FUs in the block location of the result (FU        output).    -   Bypass connection for immediate results of a pipelined FU to the        same or different FU.    -   Staging for repeated loop use of same operand(s); for example in        A(I)<=B(I)*3.75, the 3.75 operand is staged in one input of all        tarmac MPY FUs for the loop duration.    -   The same applies to the staging of operand C(J) in the loop        computation of: B(I,J)<=A(I,J)+C(J), a copy of C(J) will be        placed for inner loop duration in one side of all the FP ADD        used in B(I,J)<=A(I,J)+C(J). Upon next inner loop operation        C(J+1) will be now placed in all (one side) of FP ADD tarmac        registers before the next inner loop commences.

The Interconnect Sections

Returning to FIG. 6, note that there are four sets of “channels” ofFrames/Bins interconnect circuits in the figure, two sets are going toMemory, one set is going to the Functional Units and one set is comingfrom the Functional Units.

The first “channel” circuit set on the bottom right side of FIG. 6provides a maximum of four 512 bit (eight 64 bit words) operands goingfrom the Bins to the FU inputs. This interconnect requires a crossbar(or multiplexor) apparatus where each one of the four, 8 word FU(s)inputs, receives (reads) one out of 8 groups of 8 words from the 64words LHSM Frames/Bins structure.

The next “channel” circuits is on the top right side of FIG. 6 is the“results” crossbar where up to two 8 word result FU(s) output is gatedto write inputs of two out of the 8 segments of the appropriate Bin inthe allotted Frame positions.

The next circuit, left side bottom, is the Main Memory “channels” ofdata to Memory from the LHSM structure. Finally the circuit, left sidetop, represents the Main Memory “channels” of data from Memory to theLHSM structure. Depending on implementation and type of Main Memorytechnology the two sets of memory “channels” may use two physicalcrossbars (or multiplexors), one input and one output or a “linesharing” types of circuit(s). Each one of the 8 Frames/Bins 8 wordblocks may be sent to or received data from one of the four, 8 wordswide Main Memory ports. Thus a maximum of four eight word transactionsmay take place in each cycle.

In this case, all four interconnect circuits, being 8 words wide, mayuse most types of the spare replacement techniques used in the industry,whether the techniques are used as a part of the manufacturing processor may be used in the field for system self-repair. Note that in largefootprint devices a spare replacement strategy in the majority of deviceparts may be the basic prerequisite for manufacturing the device (e.g.,the yield will be too low without spares replacement), and also the keyto its low cost.

Data Structure and Crossbar Interconnects

The following section describes the Interconnect structureimplementation based on the use of crossbar technology in variousembodiments.

FIG. 8 is a diagram illustrating a crossbar. In the images on the right,a double line represents a full crossbar, and a single line represents aselector.

In FIG. 8, the notation used to represent the crossbar to the right(including spare) includes indication of the direction of informationflow. The crossbar is in standard notation including drivers and sparesection. The lower right diagram illustrates a comparable notation foran N to 1 selector.

Bins/Frames Interconnect to and from Main Memory

FIG. 9 describes the data interconnect between main memory and theFrames. The left side of FIG. 9 is the “read” portion bringinginformation from memory to the Frames. The right side of FIG. 9 is the“write” part of the circuit used for moving data from the Frames andwriting it into memory. The system uses 4 memory buses and the read sideand write side may or may not share the physical memory data buses. The“read data” and “write data” lines are not connected in FIG. 9 sincetheir physical connection is left as an implementation choice that maybe dictated by the main memory system.

Blocks of 8 words in main memory are deposited as blocks of 8 words inthe Bin. If X - - - XXX011 is the array base in main memory reading orwriting will commence at location 3 in the Bin and the first three Binlocations are left unchanged (write) or blank (read). If 0 basealignments for tables (or for all Variables) is desired, the tables (orall Variables) may be placed at 8 word boundaries in main memory.

The memories read and write addresses contain an 8 bit mask in lieu ofthe 3 least significant bits so the masked words are not involved in the8 word memory block transfer transaction. This mask selects theparticular single word or the section of the 8 word block to be read orwritten in main memory.

Starting from the left of FIG. 9, Bus 0 word aligner allows one torearrange the order of the 8 words on Bus 0. The rest of the 3 busesconnect directly to the second level crossbar enabling one to deposit ablock of 8 words from any of the buses into any of the 8 groups of 8Frames.

In FIG. 9, only Memory Bus 0 has word alignment features used mainly toaddress single word transactions. Bus 1, 2 and 3 are only used bytransactions that have the same 8 word alignment in Main Memory and inthe Frames. Each Data bus may carry eight 64 bit words (512 bit) percycle.

Single Word Variables

Dimensioned Variables are allocated Bin sizes that accommodate properVariable cache size for high performance, for small arrays the Binallocation may be the size of the entire array while for large arraysthe Bin size allocation will be only a portion of the array.

Single word Variables are however allocated 16 words arranged as acircular file in their corresponding Bin, this is done to supportspeculative execution (branch prediction, out of order execution, etc.)and debugging features without the need to add shadow registers toretain past program history in case of “undo” due to miss prediction.

When dimensioned Variables are changed in an inner loop typically adifferent array entry is changed by the next iteration. If A(6) ischanged in iteration 6, A(7) will typically be changed in iteration 7,etc. As long as a “write-through” to memory is not performed on loopentries during speculative computation (there may be more than onechanged entry, A(I) and A(I−1) in the example below) one may “undo” thespeculative operation by holding back the Memory write-through, as thewrite-through over-writes the previous values in Memory and cannot beundone. Specifically, with arrays the Bins may hold new the values whileMain Memory may hold previous history to support the speculativeexecution undo process.

Therefore for dimensioned Variables, one may use the combination ofspace in the allocated Bins, tarmac registers, and holding back Memorywrite-through in order to support speculative operations where one mustretain in cache or recover from Memory previous values for “undo”operations in case of a speculative miss prediction. The case where thesame entry in a dimensioned Variable is repeatedly changed by successiveiterations is rare and thus the condition may be handled by blockingspeculative execution altogether in those rare cases. In other schemasone may also use the tarmac registers for “undo” buffers.

Note the case of single Variables T and F in a waterfall sort below:

N=M

DO J=1, M; DO I=2, N;

IF (A (I)>A (I−1))

-   -   T=A(I); A (I)=A (I−1); A (I−1)=T; F=1;

END DO

IF (F=0) EXIT DO;

F=0; N=N−1; END DO;

On every iteration, the values of both F and T may be changed. Howeverthe dimensioned Variables operands A(I) and A(I−1) will not be changedafter two iterations.

As single Variables are commonly used in loops, turning off speculativeoperation for changed single variables may significantly hurtperformance. Since it is desired not to have any Variable storageoutside the Bins and the Memory, 16 words arranged in a circular filehave been allocated for each single Variable in the program. A newsingle Variable value is placed in the next position in the circularfile; it does not replace the previous value. As stated, this is done toavert the need for “shadow registers”. Due to the fact that the singleVariable word may be in any one of 16 positions in the Frames itrequires aligning the word to the proper position in an “8 highwaylanes” port. Thus one reason for the word alignment circuit for Bus 0 isthat this bus is used for single word operands read or writes operationthat requires placing the single word in specific word of a block, aposition that is different from its location in the Bin block.

This arrangement for single Variables provides clean Frames/Bins meansof support of speculative execution (such as branch prediction) withoutadding strange hardware circuits like shadow registers for holdingrecent history. The method also, in part, creates the need to align asingle word in its appropriate word location within an 8 word block inthe incoming or outgoing 8 word memory port. The alignment is done bythe alignment circuits on the top left and top right sides of FIG. 9.

When dealing with dimensioned Variables the support for speculativeexecution is done by delaying the Bin's cache-to-memory-write-throughuntil all pending speculation conditions have been confirmed. TheInstruction Interpretation and Mentor circuits check for operand changeunder speculative conditions and consequently the InstructionInterpreter properly handles the speculative steps when an inner loopchanges Variable entries.

Returning to the alignment circuits on the top left and right side ofFIG. 9, block word alignment is provided by two 8×8, 64 bit crossbarsinterconnect for block word alignment for memory channel “0”. In topleft of FIG. 9 is the crossbar for Memory to Frames data (read), topright for Frames to Memory data (write).

The central portion of FIG. 9 contains the 8×4 512 bit blocksinterconnect connecting the 4 memory ports to the eight, 512 bit blocksof eight Frames each in the Frames/Bins structure. Note that while Bus“0” goes through the alignment circuits “Bus 1”, “Bus 2” and “Bus 3”connect directly to the memory ports. Write-mask controls are used tocontrol which word within an 8 word block are written into memory orwritten into the Bin(s).

Bins/Frames Interconnect to the Functional Units

FIG. 10 is a block diagram of the crossbar interconnects from theBins/Frames to the Functional Units. As the case with Channel “0” of thememory interconnects, the crossbar structure contains 2 levels. Thefirst level is to the same type as the right center side of FIG. 9 as itis an 8×4, 512 bit wide crossbar interconnect that selects 4 out of 8Frame block to be transferred to the Functional units. The second levelcontains 4 sets of 8×8, 64 bit wide interconnect structures that alignwords within the block. To recap, the first level selects the blocks;the second level arranges the words within the block and presents theinformation on the four 512 bit Functional Units input buses.

In FIG. 10 the lines from the Word Alignment circuits to the FunctionalUnits inputs above show only a very small fraction out of the totalnumber of interconnects. The functional Units, as stated are clusteredinto groups of 8+1 Functional Units per type. Thus a group of 8 ADDunits receives two sets of properly aligned 512 bits per input on two ofthe four input buses. FU 0 of each FU group receives inputs only fromthe four Word 0 of the alignment Bus outputs. FU 1 of each groupreceives the four inputs out of Word 1 of all four buses and so on, thusfinally FU 7 receives its four inputs from word 7 alignment Buses.

Functional Units Interconnect to the Bins/Frames

The Functional-Units-to-Bins section contains the 512 bit wide 2×8crossbar interconnect leading from the Functional Units to theFrames/Bins structure. See FIG. 11. As previously stated all inputoperands have been previously aligned to the word alignment of theresults (output) so no alignment is needed in this section of the datastructure. The depositing of a single word or less than a full block iscontrolled by the use of 8 “write control” tabs per channel.

Returning to FIG. 3 the top left corner shows a block whose function isto supply addresses to the four ported memory system. The addresses aregenerated by the Mentor circuits which are the subject of the next partsin this presentation. Mentor circuits first bid and then get access to amemory port. Up to four Mentors may gain simultaneous access to thememory system. The Mentor circuits deposit the memory address for theirblock or word transfer on one of their assigned four addressing buses.The 32 bit addresses are sent to the allocated memory port using a 32bit wide 4×4 addressing crossbars shown in FIG. 12.

This small crossbar simplifies the pipelining of the operations as theallocation circuit simultaneously notifies the Mentor circuit andpre-sets the addressing path such that there are no delays in memoryaccess on the following cycle, a subject of high importance in strictlysequential programs. The memory addressing circuit in FIG. 12 shows thatup to 4 memory channels may be addressed in each cycle. For “read”operations a block of 8 words will be brought up from memory with 0 fillif memory address is outside the rage of some of the words in theincoming 8 word block.

Block Word Alignment and Pre Staging

As mentioned earlier, the current vector method of aligning blocks ofoperands through vector registers unnecessarily complicates thesoftware.

${a \cdot b} = {{\sum\limits_{i = 1}^{n}{a_{i}b_{i}}} = {{a_{1}b_{1}} + {a_{2}b_{2}} + \cdots + {a_{n}b_{n}}}}$

While some algorithms like dot product expression above fit well in thecategory of algorithms whose data flow is done through simple arraysoperand indexing, other algorithms use indexing variations of the samearray. For example, in a Simple Relaxation Algorithm, a new operandvalue may be the average of the value of its four neighbors.

DO j=2, M−1; DO i=2, K−1;

A1 (i,j)=0.25*(A (i,j−1)+A (i−1,j)+A (i+1,j)+A (i,j+1));

END DO; END DO;

While the two outer elements A (i,j−1) and A (i,j+1) do align with theinnermost loop index (i), the elements A (i−1,j)+A (i+1,j) are eitherone element ahead or one element behind the loop counter.

In either case, in order to gate the right element of the operand streaminto the Functional Unit port(s) one has to align the operands into theFunctional Units ports in the algorithmically correct form. It thusbehooves one, if possible, to use a single operand stream for both A(i−1, j)+A (i+1, j) as they are only two operands apart.

Operand Pre Staging

Consider an array whose base entry does not line up at an 8 wordsaddress boundary, destined to one of inputs of a block of ADD FunctionalUnits, while the other input is properly aligned. Assume that the lastthree bits of the array base address is 011. Upon the first input blocktransaction; the transaction will bring five out of 8 words from thearray's assigned Bin. This is not sufficient in order to perform the 8ADD operations so the inputs are pre-staged in their respective tarmacregisters. Furthermore the pipeline of the Functional Units as a wholeis not advanced. The control cycle discipline does not allow thepipeline to proceed unless all operands are present at FU inputs.

In the case that the array base address is 011, out of the first“highway transfer”, five words out of the first 8 incoming will berealigned in the alignment section by a “left three” (circular) shift.The results of the first channel transaction are deposited in the tarmacregisters of the 8 ADD units. “Left three” means that the five wordsposition 3-7 are placed in position 0-4 and word 0-2 are placed inposition 5-7. During the next cycle the same “left three” transformationis made on the next set of 8 words. Now the words in position 5-7 andthe five in the tarmac register (as well as the aligned operands on thesecond FU input) go into the 8 ADD as operands, the control cycle is nowcomplete (thus issued). The remaining five words are stored in thetarmac register as operands for the next cycle. The pre-staginginformation is retained in the tarmac registers for the loop's durationand thus the first control cycle took two clock cycles, the system isnow “self-synchronized” and the following control cycles take a singleclock cycle. Thus the alignment setup penalty is but a single cycle atthe onset of the loop.

Alignment is one use of tarmac registers, other uses are holdingconstants or a slow moving variables such as storing C (j) in:

A (i,j)<=B (i,j)*C(j);

Or 3.5 in:

B(i)<=T (i)*3.5;

In the case of a constant or a slow moving operand the operand orconstant are first pre staged and “spread” by the alignment logic fromone location to the tarmac registers to all 8 Functional Units tarmacregisters in the block and then used for the duration of the loop in oneof the input legs of the Functional Units group.

Spares Strategies

In FIGS. 4 through 12, design elements are shown in an 8+1 arrangementwhere typically the 9^(th) element is used as a spare. This is onestrategy and it is strategy for Functional Units blocks in this designexample. The Frames and the crossbars may be taken as a complete unitform from “bit sliced” units of 64 identical structures—one slice perbit, with one or more “bit slices” as spare. Alternative “spares”strategies maybe used that best fit, for example, the particular siliconprocess and the particular choice of the processor's cycle time.

As previously mentioned, technology now enables the use of sparesstrategy as manufacturing tools where good parts are selected and“zapped” into operational status or the implementation may use gating or“flash technology” or other methods for interconnect methods allowingfield self-repair.

Mentor Control Circuit Level

Processor Control Circuits Overview

A task of the control structure is the supervising the Bins in theFrames structure. Each Bin serves as the home base to one or moreVariables. A Variable is set of one or more operands, and is specifiedby a descriptor (which may be explicit, inferred, or a combination asspecified by the DONA instruction set architecture to be discussedlater). A Variable may represent an array, list, file, queue, singlevariable, constant, program, routine, instrumentation port,communication port, etc.

The Bins may serve as the means of access to all data Variables. Alaptop system, for example, may be based on two virtual levels, Bins tomain memory as the first level and main memory to hard drive as thesecond level. (In more complex systems the “permanent storage” may notbe a hard drive, but a more complex structure that may include thecloud.)

Operating according to the two level architecture principles, eachVariable is defined by the descriptor which contains physical location,size, dimensions, data type and other properties when applicable to aparticular descriptor model like Object class, Object abstraction,Object inheritance, etc. In order to present addresses to memory theprogram includes the definition of the Variable's memory location andsize or uses shared Variables that have been previously defined and thenaccess operand(s) within the defined Variable.

FIG. 13, is a redrawn block diagram that contains the information inFIG. 3 and includes several more items in the control and functionalunits sections. The Functional Units section has been modified—inaddition to the traditional arithmetic and logic functional unitsspecial Functional Units and I/O functional units are added to thediagram. One example of a new functional unit, a CAM FU is includedlater in this description. The CAM FU may perform pattern recognition,for example, on biological DNA information; finger print data searches,or code breaking.

The I/O Functional Units may allow for direct processor interactionthrough high speed I/O interfaces, including, for example,processor-to-processor InfiniBand links, DWDM SONET and Internetfiber-optic interfaces. The inclusion of high speed interfaces as partof the basic Variables in the architecture may allow the processor tobypass data transport limitations of some existing systems. The highspeed I/O interfaces allow for the forming of effective processor teamsthat cooperate by means other than interacting through shared MainMemory. One advantage of the high speed I/O interfaces is the computingdevice's ability to filter massive data before storing the relevant partof it in memory.

In this example, a set of Language functional Units (LUs) may be addedto the Function Unit section. The Language functional units are theinstruction interpretation units and are responsible for program flowcontrol of the machine and therefore the LUs are present in twodifferent places in the block diagram. As the LU is a user of a Variabletype program thus it is part of the data structure. The same LU is alsoon the top of the structure, as it is the top control element.

The LU is responsible for converting the machine language instructionstream to internal control signals such as a wide word format in theDynamic VLIW format discussed herein. In turn, the Dynamic VLIW maycontrol the Mentor circuits, which supervise Bin operations.

Control Flow Overview

As shown in FIG. 13, control in various embodiments may be divided intothree levels:

(1) Language functional Unit. The Language functional Unit may performmachine language program parsing and instruction interpretation andconvert the machine language program to dynamic VLIW program. Aspreviously mentioned, present technology allows one to designsophisticated LUs that look ahead at significant machine code sectionsand thus “understand” complete loops or small routines. This section isnot paced by instruction action launch of operands to FUs or instructioncompletion (the control cycles).

(2) The Dynamic VLIW and flow control. This section is used by all theLUs. The section progresses according to control cycles and isresponsible for issuing control commands to the Mentor section(implemented, for example, by way of one or more mentor circuits), crossbars and FUs. The Dynamic VLIW may issue data transfer requests in termsof Variables. The Mentors provide the physical Bin and Memory address,or I/O data in case the Mentor supervises “infinite external operands”.This section is paced by control cycle instruction action launch ofoperands to FUs and instruction completion.

3) The Mentor level.

The top two levels should be familiar to an ILP designer though theterminology used to describe the functions performed in those two levelsmay be different than the one used here, the next one, the Mentor levelis a new level. The following section details the diagrams, function anduse of the new level, the Mentor level, a level forming a newarchitecture innovation.

The Index Loop Circuits

In some embodiments, Index Loop circuits are used to operate loopindices. The Index Loop circuits are placed in the Dynamic VLIW controllayer. It is the discretion of the DONA (the machine language) compilerwhether a named program element that is used as loop index is translatedby the HLL to DONA compiler as a Variable or as an Index.

Definition: A named integer operand used as loop indexing control andupdated by incrementing or decrementing by a fixed integer until itreaches the prescribed limit is defined as an Index.

Complex manipulation of the named HLL operand used as the loop's indexmay require the operand to be translated as a Variable.

Using an Index instead of Variables does not violate the principles of“data conceptual units are Variables” as it is up the compiler to decidewhich HLL statements denote use of Variables and which are just“linguistic ways of stating” specific algorithmic actions. I.e. DO I=0,N; A(I)<=B(I)*3.5; END DO; is just HLL's way of saying “multiply array Bby 3.5 and store the results in array A”. FORTRAN does not provide a wayof stating action upon plurality of operands without linguisticallyusing indices.

In some embodiments, loop indices that are operated through: startingvalue, end value and increment value are assigned by DONA to an IndexLoop Circuits in the Dynamic VLIW flow control level. The Index circuitsmay be synchronized with the Instruction Interpretation and the use ofDONA LOOP END instruction such that the termination of the loop does notinvolve program flow speculation (branch prediction) as the program flowknows ahead of time when the last loop iteration will occur. In the useof history based branch prediction, in addition to incurring amiss-prediction recovery delay, the branch prediction mechanism is acommon source of out of bounds operations. The history based predictorassociated with Conditional Branch will forecast branch taken on thelast loop iteration, an action that may cause an out of bound access tothe N+1 element of an N element array. While the out of bound access isindeed recovered by the logic, the N+1 access behavior interferes withthe ability to install out of bound checks in the computing device.

Whether originating in an Index circuit or a Variable, all loop indexvalues may be broadcast by the Dynamic VLIW layer to all Mentors. Onlythe present value, the initiation and termination of the innermost loopmay be broadcasted at loop initiation. The Mentors may retain allrelevant (outer) loop information previously provided.

The Mentor Circuit as the “Addressing Processor”

In some embodiments, a processor includes N Mentor circuits; the numberfor N in this example is 48. Each Mentor circuit is responsible forhandling all the transactions relating to one or more Variables assignedto it, where each Variable resides in the Bin (and associated memory,communications and instrumentation interconnects) that is assigned tothe Mentor. Each Bin may handle one or more Variables.

The program flow activities of the data structure are directed by theDynamic VLIW. The Dynamic VLIW is the internal intermediate multipleactivities per cycle instruction format. The Dynamic VLIW issuesinstructions in terms of logical Variable IDs. It is up to the Mentorcircuits to translate Variable IDs to Frames addresses and physicallocations in Main Memory (or communications or instrumentationoperations).

The control activity done by the Dynamic VLIW for example in theoperation A (i)<=B (i)*C (i) is done in terms of the logical names of A,B and C and the index value of loop iteration “i” made available to theMentor circuits owning the Variables A, B and C.

The tasks of the Mentor circuits includes but is not limited to, whengiven the logical names “A”, “B” and “C” and the value of “i” to do thefollowing:

-   -   1. Provide to the data structure the physical Frame addresses of        the Bin operands, A, B and C.    -   2. If any the operands of B and C are not in the Bins bring the        operands from memory.    -   3. Perform a write through of A from Bin to memory when        appropriate.    -   4. Participate in word/block alignment if operands arriving in        blocks of 8 words are not properly aligned with the desired word        location of the result.    -   5. Support speculative execution when appropriate by keeping        record of past operand values and/or hold back write through to        memory.    -   6. Report bounds check violation if the address requested by the        values of the indices (“i” in this example) are outside the        Variable's range.

Access to the Variable's assigned data space is granted through a Mentorresponsible for this Variable's Bin. The Variable's assigned spaceincludes but is not limited to both the Main Memory space and the Binspace. In the command issued by the Dynamic VLIW to the Mentors,Variable's logical ID may be followed by a list of index relativeformulas.

Consider the expression:

DO i=2, M; DO j=2, K;

A1 (i,j)<=0.25*(A (i,j−1)+A (i−1,j)+A (i+1,j)+A (i,j+1);

END DO; END DO;

The Dynamic VLIW program flow control level will transfer to the Mentorsthe indices relative formulas to the equivalent of:

A1 <=0.25*(A (,−1)+A (−1,)+A (+1,)+A (,+1));

The four formulas [,−1]; [−1,]; [+1,]; [,+1] specify the index valuesrelative to the inner loop (i) and the next higher level loop (j). Theformulas are unchanged through the two level DO loops run. Given theformulas and current value of loop indices, it is up to the Mentorcircuit to calculate and produce the correct physical operand addressesduring each program cycle. Note that A1 does not require a formula as inabsence of an explicit formula the current values of indices (i,j) areused. The use of (loop) index relative notation also makes for compact(DONA) machine code and simple correlation of object code to sourcecode.

Computing operand locations for a multi dimensioned Variable willtypically involve additions and multiplications. Therefore thecomputation may take several cycles, however note that the operationsonce performed will typically produce a value that requires only anincrement (or decrement) by a fixed number in the following cycles, thusthe operand address computation fits in the pipelined processor controlcycle method of operation. The first control cycle in a loop may takeseveral cycles but the following iterations may be done in a singlecycle.

As a result of assigning the index arithmetic calculation to the Mentorsand providing to the Mentor circuits the state of loop iterationcircuits through the Dynamic VLIW control circuit, just about allindexing calculations are offloaded from the main code to become tasksfor the addressing circuits in the Mentors. The Mentor and loop indextasks are done outside the data structure and in parallel with the maincode. Comparing to the instructions activity in a traditional (ILP)register machine, typically two thirds of the instructions activities,specifically the “where to find [the operands]” (Backus) part, may beoffloaded from the main program flow to the Index Loop circuit andMentor circuits.

As a result, the main program control is now mainly about “trafficconcerns . . . significant data itself”. The “where to find it” part ofthe program's work is done now in parallel by the Index circuits and bythe Mentor circuits.

Mentor circuit, Tasks and Functions

A Variable is managed by a logical Mentor through a physical Mentorcircuit operating on behalf of the Variable's logical ID. A physicalMentor circuit manages a limited number of logical Variables. In theexample presented here, a Mentor may handle a single dimensionedvariable or up to 64 single Variables and/or up to 1024 constants. Thenumbers of Variables handled by a Mentor circuit may be increased byincreasing the design complexity of the circuit, the spares partsapproach advocates that the number of Mentor circuits should be 8 ormore. In this example, the processor includes 48 physical Mentorcircuits.

Variables are operated through Mentor/Bin pairs where the Bin containsthe Variables immediate data and the Mentor contains the Variable'sdescriptor, addressing formulas, and controls the actions involving theVariable. For dimensioned Variables the Variable's full data set istypically stored in Main Memory and the Bin may contain all or a part ofthe Variable's data set, thus the Bin is acting as a “cache” for thecurrently used portion of the Variable's data.

FIG. 14 is a functional diagram of a physical Mentor circuit. Thediagram relates to the different functions performed by the logiccircuits inside the Mentor, FIG. 14 it is not a hierarchical diagram ofcontrol.

Starting at the bottom of FIG. 14 one can see three sections. Thecentral section is a hardware addressing circuit responsible forgenerating all the Main memory addresses and all Bin addresses foraccessing individual elements of Variables. For complex addressingexpressions, as in inversion layer in a data base the Mentor may alsoreceive information from the data structure section.

At the left bottom section one sees the Variable's self-bounds check.This circuit performs bounds checks on all main memory and Bin addressesgenerated by the Mentor. The mentors perform those checks simultaneouslywith sending the addresses thus no delays are incurred due to boundscheck. The design contains a pipeline halt mechanism so the action isstopped prior to consummation (memory write through) for the very rarecases a bounds check violation flag is raised.

Bottom right is the Intrusion bounds check. This circuit is monitoringall four memory address buses for cases that someone else (anotherMentor, or another processor in a shared memory system) violating theVariable's assigned space. This circuit, for example, protects againstallocation errors and malware. An allocation error may occurs when a newVariable is allocated memory space while the system forgot to deallocate that space from an existing Variable and/or forgot to removethe old Variable from the system.

Going now up in FIG. 14, the conflict resolution logic is responsiblefor obtaining Memory port allocation for Bin to memory and memory to Bindata transfers.

This part of the Mentor circuit is also responsible for negotiations andstaging when more than one Mentor attempts to access the same physicalFrame block for Functional Units operands transfers, as well as when anextra cycle is needed for staging and block alignment. As was describedin previous parts, the Mentors and allocation logic are using thecontrol cycle discipline to resolve Frame contention. Once multiplerequests in the same cycle for the same Frame are present, one Mentorgets access to the Frame forming a pre-staged transfer to a tarmac or amemory port.

The following cycle both first and second Mentor channel requestsperform a transfer. The process continues until the entire sets ofrequest in the control cycles have been honored. As previously explainedonce the pre staging is done during the first loop iteration, thefollowing loop iterations take only one cycle per control cycle as theoperands flow is now pre staged. The competing Frame requests may comefrom the same Mentor as each Mentor may issue 3 channel requests percycle, two Bin-to-FUs and one FUs-to-Bin, or they may come fromdifferent Mentors.

The next level up in the FIG. 14 is where, for dimensioned Variables theVariable's “cache management” resides. Specifically this functioncontains the table and controls for which sections of the dimensionedVariable are placed in the Bin and contain the logic to handle incomingsection (read) and outgoing sections (memory-write-through) as well astraffic in and out from the Bin to/from the Functional Units. (Note; aprogram code section or a variable length list or string is considered adimensioned variable.)

For single Variables this section contains the pointers for the 16element circular file that retains the Variables history.

One will also find in this section the logic for “cache coherency”support for dimensioned and single Variables. As stated; the Bin is the“private cache” of each Variable.

The next section up in FIG. 14 diagram “History and logic for Undo”.This section is concerned with all the elements of speculative execution(branch prediction, out of order execution) and using “undo” thatparticipates in debugging support. Speculative execution requires thatthe processor logic retains full context to undo operations in case thespeculative assertions fail. The system may also be designed to provideundo capability beyond what is needed by the speculative actions in thehardware. Specifically to allow for undo back to checkpoints defined bysoftware. Please note that the processor does not deploy shadowregisters, undo operand storage is supplied by the (Dynamic VLIW andMentor programmable) use of the Bins. This includes using circular filesfor single word Variables and holding back the Variables'memory-write-through until all speculative assertions have beenverified. Thus due to the programmable nature of the undo capability itmay be extended beyond present hardware needs of supporting speculativeoperations.

The top part in FIG. 14 deals with the general Mentor control andVariable's definition parameters. A Mentor's ownership of a Variable maystart when the Mentor receives from the DONA Instruction Interpreter theVariable's definition. The Mentor contacts the Resource allocationcircuit in the Dynamic VLIW layer to receive the Variable's Bin spaceallocation.

The Resource allocation circuit may be contacted by the assigned Mentorat the initiation of a routine to request Bin space allocation in theFrames structure to Variables in this routine. Shared spaces or transferVariables (CALL by name) have already assigned space. Operands in CALLby value are assigned as constants in a Mentor/Bin assigned to the newroutine. Other, more sophisticated CALL/RETURN techniques, likeInheritance and Polymorphism properties in Object oriented C++ may beincorporated in the Mentor and Dynamic VLIW control design as decided byan implementation system architect.

The Variable definition, among other things, contains Mentor ID,Variable type, size, dimension, and Variable's location in main memory,and Bin allocation. The Bin may be uninitialized or initialized by itsprevious use. A discussion of the use of the Mentors as limited physicalresources to support a large number of virtual Variables follows in TheVirtual Mentor Files section.

Logical Mentor

As used herein, a Logical Mentor is a logical entity responsible for theintegrity, addressing operations and/or other functions of Variable(s)assigned to the Logical Mentor, whether it assigned to a physicalMentor, to a Virtual Mentor File (VMF) or to other software and/orhardware mechanisms included in the architecture. The Logical Mentorconstruct includes the existence of physical and logical mechanismsresponsible for the creation of a Logical Mentor and the conversionsbetween Mentor VMF state and logical mentor using a physical Mentor. Astechniques like Object oriented Inheritance and Polymorphism or DataFlow programming are incorporated into the computing devicearchitecture, the logical Mentor design may include the appropriatesupport for those capabilities. The physical Mentor used by a logicalMentor may be similar to the Mentor circuit described here or it may usea different implementation including portions in hardware and otherportions in software. The logical Mentor mechanism as described hereinmay also include a “conscious dormant” state capability throughallocated active hardware or software, where for example address boundschecks or other mechanisms may automatically render a Variable active,including the assignment of a physical Mentor to the Variable.Conceptual relations between physical and logical Mentors are similar tothe conceptual relations of virtual memory pages. In “bottom up”physical view a page is in dynamic memory or it is on the hard drive (orboth). However in the top down logical view a logical page is in thememory system and the virtual mechanism, however it is physicallyimplemented, is responsible for the operational details. The followingdiscussion, for reasons of full disclosure, takes the bottom up viewassuming the same information may be converted to the top down logicalview by a person skilled in the art. Logical Mentor as defined hereinincludes the existence of the full hardware and software mechanisms toimplement the logical Mentor whether they are similar to the physicalmethod described herein or totally different.

The Mentor Circuit Interconnects

FIG. 15 is viewing the Mentor circuit from interconnects and functionalblock diagram point of view. An important aspect to note is that interms of sophistication and complexity a mentor circuit may rival thecomplexity of many processors, in the same manner that a single jetengine of a modern airline exceeds the complexity of many propellerairplanes. The task objective of a Mentor circuit is to assume fullresponsibility for the operation of one or more Variables. Specificallythe task of Mentor circuit is to do all the Backus “where to find it(and protect it)” tasks of array index calculations, bounds checkingincluded in the mapping of a logical Variable to physical Variableoperand address and cache management and other to be defined tasksrelating to security, simplicity and system robustness.

Mentor Circuit Items

Starting at the top of FIG. 15 the Mentor circuit includes 4 Intrusionbounds check circuits, one per memory address bus. While each Mentorcircuit performs full bounds self-checks for the conformance of itsgenerated memory addresses with the space it has been assigned, spaceintrusions by malware or errors may still occur in the system. Forexample space has been allocated to a new task or Variable without beingde-allocated from a current task or Variable. The Intrusion bound checkcircuits checks each cycle for “is anyone operating, by read or write inmy assigned space”. By the Variable data definition all data accessoperations must go through the Mentor circuit assigned to a Variableowning the space. This includes shared spaces where the Mentors assignedto a shared space knows if the space intrusions are or are not part ofthe (token passing, etc.) legal procedures for shared spaces. The 4Intrusion bounds check circuits detect space intrusion violations causedby someone else in the system.

Moving to the left top side of the central box in FIG. 15 leads to theRead and Write Bin to/from FUs Commands. The input segments from theleft top include 6 logical Variable's commands. Up to 4 Variable read toFU commands and up to 2 Variable FU to Variable write commands. TheVariable's logical ID is broadcasted according the format in FIG. 16.The Mentor examines the logical Variable IDs in the command and onlyaccepts commands that match their assigned Variables logical IDs.

Each Mentor circuit may simultaneously check each cycle all six datatransfer commands issued by the Dynamic VLIW for a match to the VariableIDs the Mentor is handling. As a practical consideration for this designthe Dynamic VLIW may not issue more than two read operations and onewrite operation per Mentor per cycle. (In the rare case that this is alimitation a second physical mentor may be assigned as a “helper”.)

Next input is an 8 bit thread ID. When commands are issued in terms oflogical Variables the namespace of the logical variables belong to aprogram thread. The full logical address contains the thread ID and alogical Variable ID within the thread. A context switch may assign theprocessor to a different thread which will use either newly assignedMentors or reactivate Mentors previously assigned to the now reactivatedthread. This conscious-dormant technique enables an inactive threadowning one or more Mentors (dedicated Mentors that are not available tothe current thread) to keep a watch over its address domain and becomeactive within few cycles as the dedicated Mentors holds the thread'soperating context. The activation may be due to a CALL, RETURN (to) orthe detection of address space violation of the dormant thread space byits dedicated Mentor.

Going down on the left side one finds Literal (Address) and Bin Setup.This input in FIG. 15 is coming from the Dynamic VLIW literal, it isused in generating memory address when, for example, the programcontains a fixed memory address to post a computational result in theaddressed location or an address to jump to in case of a programcompletion or error. This field is also used in order to set up theMentor circuit with the addressing formulas.

Program Loop Index Value and Status

The next item going down on the left side of FIG. 15 is broadcasted fromthe Dynamic VLIW Program Flow Control FIG. 13 to all the Mentors is thestatus of current index value and loop termination status including aLOOP END, BRANCH to a location outside the loops area, or exception. TheMentors retain all (outer loop) parameters obtained prior to inner loopentry as they do not change as long as the innermost loop is active. Asstated, active loop indices are typically used as operand in indexingformulas.

Pointer Status Save and Undo Commands

The following input in FIG. 15 is a set of control lines from thespeculative execution portion of the Instruction Interpretation circuitrequesting that the history stack perform a “program status save” andretains the Bin index values so that the Dynamic VLIW program may returnto this location. A status save instruction is given for example uponany speculative execution including program going through a CONDITIONALBRANCH based on branch prediction. The status save mechanism is alsoused by the software debugging tools for Undo operations that go beyondthe status save range required by speculative out of order execution orbranch prediction.

Continual status updates are also received by the Mentors from theDynamic VLIW control when speculative program flow status has beenconfirmed, for example after the branch prediction test was done and theresults agree with the prediction, indicating that the information is nolonger subject to “undo” and Bin content, for example may now bedeposited in Main Memory (write-through). The confirmation of Branchprediction or other speculative operation is necessary prior to theMentor initiating any information write-through to memory or dataoverwrites as a write-through to memory is not recoverable. “Statussave” and “Undo” commands are broadcasted to all Mentor circuits by theDynamic VLIW Program Flow Control.

Computed Address

This input to the Mentor circuit in FIG. 15 is an operand coming fromsingle or dimensioned

Variable such as array, table, queue or list residing in a Bin. Theoperands are computed by algorithms that are typically using addressingformulas that are more complicated than the range of computations doneby Mentor addressing formulas. Case in point would be addressingformulas that themselves require access to operands in Variables in thesame or other Bins. An example would be the computing of the invertedindex into the data files of a data base. Where the scope of Mentoraddressing functions ends and the use of computed addresses begins mayalso depend on the type of mechanisms that are included as standard inthe hardware and (DONA) machine language of the computing devicedescribed herein regarding the inclusion of new mechanisms and existingsystem software mechanisms present for example in C++, Java and VisualC++ concerning: Class, Encapsulation, Inheritance, Data Base invertedindex files, etc.

The computed address is fetched from a Bin and is put bus “0” word “0”to feed the Mentor's Computer Address bus as similar to sending anoperand to a FU using the leg input ID of the FU. The Computed AddressBus is broadcasted to all the Mentors for computed address input.

Access Grant to Memory

The information at this input is coming from the Memory bus arbitrationcircuit. Mentor ID plus Memory port grant information is transferredfrom the Memory bus arbitration circuit in the Dynamic VLIW Program FlowControl. The information includes the Memory port assigned to the Mentorand number of Memory cycles allocated (1, N, unlimited). The Mentor isto proceed with Memory port transfer during the following cycles.

Virtual Mentor File (VMF) Load/Unload

The last entry on the left side of the diagram is a path that enablesthe creation and use of Virtual Mentor Files (VMF). VMFs are the basisfor a virtual Mentor structure. To activate a Mentor, the Dynamic VLIWrequests from the Dynamic VLIW Program Flow Control resource arbitrationsection a Mentor assignment as well as the assignment of Bin space basedon the Variable Data Definition section in the VLIW program (and/or/whenother means for example polymorphisms and self-defining data areincluded). The Mentor and Bin space are returned to the resourcesallocator at program termination.

The resources allocation circuit may exhaust either Mentors or Binspace, in this case some of the current Mentor+Bin spaces are turnedvirtual. The area in the Bin that needs to be transferred to Memory(write-through) is transferred to Memory and the content of the Mentor(Variable's base address, Variable type (Class), dimensions, etc.) isturned into a Virtual Mentor File (VMF) See FIGS. 17 and 18. A DedicatedMentor is assigned to the VMFs as the VMF set is also a Variable. Asdescribed above, the Mentor content may be mapped to VMF format if theresource allocator circuit turns a Mentor assignment dormant in order toassign the physical Mentor circuit and/or Bin space to another task. TheVMF is typically turned back from dormant to active if a programattempts to use the Variable. In that case the resources allocator mayturn other Mentors to VMF status. The resource allocator may use LeastRecently Used or another algorithm to decide which Mentors are activeand which are dormant. The VMF operation and VMF Mentor is the subjectof the following The Virtual Mentor Files (VMF) section.

To Memory Port Control

Starting at the top right corner of FIG. 15:

Request Tags. On the right side of the figure starting at the top, onesees the memory access request tags used by the Mentor to gain controlover one or two of the four Main Memory ports. The request tags are sentto the Memory bus allocation circuit. There are six “request tag buses”,one each of the six transfer channel requests.

When the Mentor decides it needs Memory access it may post, during thecycle after receiving a command, a request on a tag bus corresponding tothe original channel operand transfer requests. The allocation circuitknows the ID of the requesting Mentor by keeping track of the commandissued one cycle ago. The request also identifies whether this is aregular or a Semaphore request (shared space) and whether it is for aword, a bank (8 words), multiple banks, byte stream, etc.

To Frame Structure

The following sets of signals contain the corresponding information toFIG. 16 Dynamic VLIW channel transfer commands. While the channelcommands (FIG. 16) are given by the Dynamic VLIW in terms of logicalMentor/Bin IDs the information provided by the Mentors to the Memorybuses, Frame addressing and crossbars controls is in terms of physicalMemory addressing, Frame addressing and crossbar control information.The Mentor translates the signals from the Language Unit+Dynamic VLIWcontrol structure operating in a logical Variable space to the physicalspace of the memory and data structure.

Bin READ/WRITE Word Masks and Addresses

These signals select the Frame blocks and provide both the addresseswithin the blocks and the Read-Write control signals to the Frames forall memory and FU information transfer. A single Mentor may select up tofive Frame blocks (8 words each) in a single cycle, two for Bins to FUs,one for FUs to Bins and two for communicating with memory.

As previously discussed Frame access conflicts may occur. The pipelinecontrol circuit is using tarmac registers and the control cycle methodin order to resolve the “collision of requests”. The pipeline controlcycle circuit blocks the pipeline advances of the FUs pipeline andDynamic VLIW, while it takes extra cycles to “pre stage” tarmacregisters and resolve collision conflicts. The “pre staged”configuration is retained throughout the loop duration such that the“pre staging” penalty is typically paid only once at the onset of a loopoperation.

To Operand Crossbars

Routing Controls for Bins to FUs

Those signals provide the destination FU buss address for the crossbarcontrol. Note that each cycle in this implementation a Mentor may issuecommands on maximum of two out of four of the crossbar control busesactions.

To Results Crossbars

Routing Controls for FUs to Bins

Those signals provide the source FU buss address for the crossbarcontrol. Each cycle on this implementation a Mentor may issue commandson maximum of one out of two of the crossbar control buses actions.

To Bins To/From Memory Crossbars

Routing Controls for Bins to/from Memory

Provides controls for Memory to/from Frames crossbar interconnect. ASingle Mentor may be assigned up to two out of the four Memory ports onthis implementation.

Memory Addresses.

The last two outputs provide the Main Memory addresses. Each address busmay be used in a read or a write operation.

Circuits Inside the Mentor

In this description we only detail the major blocks of circuits insidethe Mentor, which is the central box in FIG. 15. Implementation may varybased on design decisions influenced by pipelining and cycle time aboutexactly which parts of tasks are done by the Mentors, and which parts oftasks are done by the Dynamic VLIW Program Flow Control, the busarbitration circuit(s) and the resource allocation circuit the DynamicVLIW Program Flow Control (FIG. 13).

The choices are practically dictated to large part by the architecturefeatures that are basic to the computing device hardware/softwareinterface design approach and tested concepts and ideas in languageinterpreters like Java, C++, etc. Re-inventing a hardware/softwareinterface all afresh is impractical as it takes too long.

The hardware/software interface choices include the basic namespaceapproach; mapped or physical location based. I.e. is: “Variablename=Variable base location in Memory” or Variable ID is a mappednamespace. The choices also include the selection of function/featuresset one wants to match. (This is computer talk for “the rules and lawsof the computing device's legal system”).

If, one for example chooses C++ as the software transport route to moveaway from Register based machine language to a machine language usinginformation processing's “conceptual units”, one should at least startwith a proven system already operating in the “conceptual units languagespace” and include in the software/hardware interface (data definitions,instructions definitions and all their intricate relations) a completesubset of the instruction types and data structures in all their modesof operations in a fashion that matches the: object, class, abstraction,encapsulation, inheritance, polymorphism, etc. of C++. One may addfeature/functions, (VMF, IMMUTABLE, etc.) but typically subtraction offeatures is problematic.

The match may be “bit compatible” or functionally compatible. If thearchitecture choice for the computing-device is to be “bit compatible”,existing code and data may be used “AS-IS”. If the choice is for“functional compatibility” code and/or data may need to betransliterated from the existing forms to a new forms. (If Greecedecides to adopt the US legal system “AS-IS” all Greek citizens betterspeak English. If Greece decides to functionally adopt the US legalsystems, US laws, precedence cases, etc. must be fully translated toGreek.)

Implementation of Low Use and Future Innovation Function/Feature in theMentors.

Mentors have direct access to Memory, their assigned space as well asother spaces (as long as the Mentor assigned to those spaces does notobject). This allows the computing device architect to use both Memoryand/or associated Dynamic VLIW buffer space to implement low usefunctions as well as proposed new features/functions, by (existingrecompiled) software and microcode instead of by hardwiredimplementation means.

This capability is very important for software migration as thecompatibility rout taken may include elements that clearly need tochange (for example the use of multiple threads for implementing microparallel functions in CC++ is very inefficient) but the functions areneeded in the transition period.

This capability also addresses newly invented feature/functions insecurity and other areas that should be fully field tested despite theirsluggish original software/microcode operations prior to being installedinto the hardware.

For example consider that an encryption method is proposed for VariableID and the Mentor is to do part of the encryption/decryption. The systemmay be first tested with the Mentors using a “sneak memory path” to letan encryption/decryption software program do the actual work, but to therest of the system it looks like the Mentors are doing the task.

The Circuits Inside the Mentor

FIG. 15, from the top down.

Bounds Comparator(s)

The Memory address bounds comparators contain the bound intrusiondetection mechanisms. The circuits are checking against Memory addressesthat are generated by other Mentors and intrude into the domain assignedto the Mentor's Variable(s). For efficient operations single Variablesshould be clustered into a continual memory space and served by a singleMentor.

Memory Read Pointers, Memory Write Pointers

The Memory Read and Memory Write pointers locate the next block foreither a read from Memory or a write-through into Memory. During Memorytransfer activities, the pointers are checked against the Base and LimitRegisters (or other more complex mechanisms if the Variable is in adistributed structure as may occur in cloud storage) to verify thatMemory operations do not reach beyond Variable's bounds for theself-check done by the Mentor.

Single Variable Pointer

Single Variables, stored in a single word in Memory are retained in a 16elements circular file, so that the file retains the past values historyfor Undo operations during speculative execution. The past historyretention is also an important element in debugging. A Mentor may serveup to 32 single Variables and 512 constants.

Memory Base and Limit Registers

The register contains the Variable's base address and the size of thedimensioned Variable (or more complex mechanism supporting a distributedstorage system and chosen new system architecture features).

Registers for Multi Dimensioned Variables

The registers contain the base number (usually 0 or 1) and the size ofthe particular coordinate for all dimensioned Variables, accommodationsup to 6D Variables. Higher dimensioned Variables are handled by thecompiler as arrays of arrays.

Bin Addresses Formation and Self-Bounds Verification

A set of arithmetic elements including add/subtract units, multiplier,and logic elements for concatenate and mask operations. The Controlcycle mechanism enables complex addressing operations that take longerthan a single cycle. The multi cycle operations is to be set up totypically only take multiple cycles the first loop iteration and asingle cycle the following iterations.

Mentor Command Format

FIG. 16 contains the format of commands sent to the Mentor by theDynamic VLIW.

Mentor ID identifies the logical ID of the Variable receiving thecommand if the following L/P bit is “0”. If the L/P bit is “1” thecommand addresses a physical Mentor circuit. Physical addressing is usedin Mentor set up and in diagnostics.

Channel ID specifies the operand channel involved. 0-3 Bin to FU(operands) channels, 4 and 5 are FU to Bins (results) channels.

FU leg ID identifies the Functional Unit leg involved in the transfer.

Indexing formula identifies one of 16 indexing formulas used to generatethe Bin's operand address in this data transfer. Formula “0” is thedefault case of using loop indices unmodified as Variable indices. Theindexing formulas have been pre-loaded to the Mentor prior to thecommand appearing in the Dynamic VLIW.

Tarmac setup refers to three setups for pre staging operand(s) in atarmac input leg for the duration of a loop. Code 00 no setup. Code 01;apply input word “0” to all tarmac words in an FU block input. Code 10;apply an operand or an input block to the 8 tarmac registers. Code 11 isyet to be defined.

For information regarding the reasons for operand setup, considerexpression like: B(I)<=A(I)*3.5 and C(I,J)<=D (I,J)+E (J), where oneoperand stays constant during the loop computation. Note that theoperands 3.5 and E(J) respectively stays constant through the all theloop's iterations. Tarmac setups options allow for setting up a singleoperand or a bank of 8 operands in the FUs' tarmac registers as the samevalue input for the duration of the loop.

The Virtual Mentor Files (VMF)

The Virtual Mentor Files may overcome the physical constraint of alimited number of hardware Mentors circuits and limited memory space inthe Frames.

In some embodiments, a “VMF Mentor” is permanently assigned to thehandling the virtual Mentor stack containing a set of dormant VariableVMFs that are used to reactivate dormant Variables. The VMF information(in FIGS. 17 and 18) is used to reactivate dormant Variables.

The VMF Mentor may operate in concert with a special circuit, Mentor/BINresource allocator (allocator) residing in the Dynamic VLIW programcontrol level (FIG. 13). The Mentor/BIN resource allocator circuitreplaces part of the L1 cache control circuit in current designs. Theallocator is responsible for the allocation of Mentor circuit(s) and Binspace(s) in the Frames to new Variable(s) being activated by a program,as well as to Variables that were previously turned dormant by theallocator as part of the Mentors and Frames space (virtual) sharing ofphysical resources. The other part of L1 cache management, which is theplacing information in cache and the cache write-through to Memory isthe task of the Mentors.

When a new Variable is initiated by the VLIW program a request for aMentor and Bin space is sent to the Mentor/BIN resource allocatorcircuit, which grants this request. In case of insufficient space in theFrame structure or insufficient number of physical Mentors theMentor/BIN resource allocator circuit may turn, based on least recentlyused or similar algorithm some of the Mentors dormant. The context ofthe Bins turned dormant is stored in memory and the Mentor informationis stored in the VMF file. The VMFs of Variable(s) turned dormant arefound in the VMF stack handled by the VMF Mentor. It is a design choicewhether the copy or address in the VMF stack entries also includes(additional) runtime information.

The Mentor/BIN resource allocator circuit is similar in function to thehardware circuit(s) used in cache management where data segments areplaced in L1 cache and the complete data is stored in main memory(including L2, L3 and other means). It is also similar in function tothe allocation tasks of the software program handling virtual memory,specifically the allocation of dynamic memory space to pages that arepermanently stored on a hard drive or other permanent storage means.

The allocation may involve both the allocating a Mentor circuit and theallocation of the “Bin space” in the Frames. Upon new Variable(s) beingactivated by a program, both Mentor and Frames space are allocated tothe Variable(s). Upon the program's completion (RETURN) both the Mentorcircuit and the Bin space are surrendered back to the Mentor/BINresource allocator circuit.

Not all the Mentors and Bins that are used by a routine are newallocation Variables that receive their allocation at the commencementof the routine. The shared memory areas (COMMON) that have beenestablished by the program making the CALL (or its ancestry) orVariables passed to the routine through another CALL mechanism arealready operational Variables. (If the information contain CALLparameters passed by value the CALL parameters are stored in the Bin(s)as constants, but Mentor and Bin space may be allocated).

An advantage of an LRU allocation method is the quick reuse of theMentor/Bin pairs. Specifically when, for example, a routine CALL is partof a loop in a program the routine performs a RETURN to be called againupon the next iteration of the loop. Reallocation of the identicalMentor/Bin pair(s) saves Mentor set up time since the Mentor circuitsmay not need to be reinitialized with the Variable's parameters(Variable's dimensions, size, memory location, Bin location) as they arethe same as the last time the Variable(s) were used.

Upon an Instruction Interpreter allocation request for Mentor and Binthat cannot be fulfilled by the Mentor/BIN resource allocator due to thefact that all Mentors or Bin space are spoken for, the Least RecentlyUsed Mentor is turned dormant by the Mentor/BIN resource allocator.

Before turning a Variable dormant all the corresponding Bin content areplaced in memory by a Bin to memory write through. After the completionof the LRU Mentor memory write-through operation the VMF of the dormantMentor is placed in the VMF file.

The Variable is activated again using the VMF when program flow controlreturns to a program using the dormant Variable(s).

FIG. 17 provides details for a VMF associated with a dimensioned array:

-   -   Size: The size of the VMF in 64 bit words    -   Mentor Type: This selection specifies the Mentor's assigned type        based on the type of Variable(s) the mentor handles, in this        case “array”. The example given in FIG. 17 is for a dimensioned        array. Other Mentor's assigned types (classes in C++, Java,        etc.) will handle single Variables and constants, (FIG. 18) byte        strings, Variables with varied element size, communications        channels, etc.    -   Variable ID: The Variable's logical ID in the routine        (Variable's name).    -   Dimensions: Single, 1D, 2D, 3D, etc.    -   Mutable: 0X mutable, 10 read only, 11 write only.    -   Bin location: Bin location in the Frame/Bin 262144 word address        space.    -   Bin size: The size allocated to the bin in 64 bit words.    -   Variable Memory address: The Memory base address of the        Variable.    -   Variable size: Variable size in Memory.    -   Dimension Base: 00 Base 0, 01 Base 1, 1X the base is defined by        Variable Type.    -   Dimension size: Size would be size of each dimension, in a 2D        array it is the size of the rows and the columns.

FIG. 18 is the VMF format for a Mentor/Bin assigned to single Variablesand constants.

-   -   Size: The size of the VMF in 64 bit words    -   Mentor Type: This selection specifies the type of Variable(s)        the mentor handles; in this case the Mentor handles a number of        single Variables and constants.    -   # of Variables: Number of single Variables in the Block.    -   # of constants: Number of constants in the block.    -   Bin location: Bin location in the Frame/Bin 262144 word address        space.    -   Bin size: The size allocated to the bin in 64 bit words.    -   Variable Block Memory Address: For bound protection all single        Variables and constants are allocated in a contiguous space in        memory.    -   Variable ID: Following is the list of IDs of single variables in        this block.    -   Constants Base: The Bin address where the set of constants        begins.

Mentor Based Conscious Multithread Capability

A basic operational assumption in operating a processor presently isthat all the hardware resources are dedicated to the program now beingexecuted, in the case of Mentor circuit it means that all the Mentorcircuits will be assigned to Variables in the current program. Howevernote that Mentor circuits may be assigned differently based on themission of the system, rather than by looking only to optimizeoperations for a single program. To illustrate the issue we show theroles that may be assigned to Mentors in a real time system doing bothforeground and background tasks. In a system controlling a robot theforeground tasks concern the robots motions. The background tasksconcerns doing routine checks of all sensors, motors and actuators aswell as power supply, etc. including keeping an accurate log of all pasthistory of the maintenance checks.

In some present systems, the system minimizes delay when interruptingthe background program if a control or sensor signal is issuing aninterrupt. Taking control away from a background task may leave all itsmaintenance files in disarray. In those systems, the background task isbroken into chunks, each chunk done in a time period that is smallerthan the specified “minimum system response time” after which asupervising program poles for the presence of an interrupt event. This“minimum system response time” may be quite lengthy as it requires fullcontext switch time thus may be too long for some real time systems.

Using the Mentor circuit, the system may maintain multiple programcontexts. The foreground program is assigned a set of Mentor circuitsand associated Bin spaces in the Frames. The Resource Allocation circuithandles those Mentors as reserved and thus not available from theallocation pool. The background program may now be interrupted in aprocess that has the reaction time of just few scores of cycles forflushing the data and control structure pipeline. The foreground threadis turned active in less than 100 cycles and the background threadturned passive without the loss of any information in the background orforeground programs.

In this approach, system management OS layer(s) operates similar to theforeground layer in the real time system above. The approach allows theOS to keep “conscious dormant thread(s)” by allocating permanent Mentorand Bin resources to the OS task(s). A conscious dormant OS thread forexample may need two reserved Mentors, one for its program and one forits data. Additional Mentors may be assigned when the OS tread isactivated. The OS thread activation may be due to a (proper) normal CALLor a Memory address bounds intrusion (error/exception) detected by theOS active Mentor. Also note that the method of thread activation byencroaching on the OS space does not necessarily signify an error. Theapplication program may simply access the OS space by exception whenneeding information or OS services. The access causes activation of theconscious dormant OS layer which now services the request and thenreturns to a dormant state. While this method of thread interactionmeans reducing some of the resources available to the applicationprogram, the proper implemented, may simplify the OS and significantlyincrease the overall system efficiency.

In a register machine a dormant thread typically relies solely on timeslot assignment (or hardware interrupt) in order to get control,otherwise the thread has no ability to know anything about the ongoingoperation. In the embodiments described herein, however, the Mentorsassigned to the conscious dormant thread may be left active and anyattempt by intent or means not previously established to accessinformation in the memory domain(s) assigned to the dormant thread willturn the thread active again. The thread is thus dormant but conscious.Not only is the current program status of the OS control threadpreserved but its full control over its proprietary memory area may beprotected throughout the system's operation.

Conscious multithread capability as described herein is “native” to acomputing device with a Mentor control layer and may, in some cases, beapplied to conventional register architectures. The technicalimplementation may be based on any possible variations of includingthread change mechanism that employs means such as entering protectedspaces, requests for resource allocation changes, calls for specificroutines, etc. to trigger the thread change. Similarly there are manytechniques one may include in register machine architecture, without theuse of Frames/Bins and Mentor technology examples described herein inorder to include the context storage of the “conscious thread(s)” in theprocessor's architecture, however the techniques are typically ad hockand reactive, the introduction of the Mentor circuit helps put structureinto the picture.

To conclude the technical description of the Mentor circuit considersthat early on computers were mostly about algorithmic number crunching.Today the emphasis is data, personal data, “big data”, national securityand economic data, etc. The Mentor is an element responsible for aspectsincluding the use, integrity and security of data elements.

The Mentor Roster checks circuit.

The reader may have noticed that the Mentor described herein two typesof bound checks, self-bounds checks and intrusion detection. Whileself-bounds checks are typically within the operational logicconnectivity of the Mentor, the intrusion detection circuit relies onphysical connection to one or more memory buses. In a PIM typeimplementation this connectivity is possible, but to a limited set ofprocessors. However, in alternate, pin limited embodiments monitoringother buses is not physically practical.

An alternate embodiment methodology to provide intrusion detection andother detection of conflicts is the “roster checks circuit”. The rosterchecks circuit is based on a fast table lookup high speed memory device.The roster checks circuit may reside in a special device and containsmemory look up table and other logic device(s) that do “security andrights checks” at Mentor circuit requests coming from Mentor circuitsfrom several computing devices.

When the Mentor in a system using the roster circuit is activated itsends (for example via a high speed serial InfiniBand link) to theMentor roster check circuit its VMF information specifying its ownershipor shared ownership of a memory space and possession of other rights andresources. Note that while multiple Mentors are engaged during normaloperations, Mentor/Bin pairs are initiated one at a time, so all theMentors in the computing element may share a single “confirmation andreporting link” to the roster circuit.

The Mentor roster checks circuit, typically using a table looks upmethod, checks for compliance and/or conflicts. The requesting Mentormay choose to halt operations before a confirmed “no conflict” messageor may proceed (speculate) assuming a positive confirmation of rights.

Most present operating systems employ software check point strategy sosince the discovery of violation/conflicts is rare and the cost ofrecovery is typically the return to the last software checkpoint, theoption to continue assuming no roster violation is typically therecommended method of operation.

There is also an advantage to not changing mode of operation upondiscovery of violations. Strange errors tend to be intermittent and onlyshow up under particular conditions, so one wants to keep the operatingconditions the same until the error is isolated. Also, sophisticatedmalware may be programmed to hide when system conditions change.

The Mentor roster checks circuit may assume additional functions, forexample it may include, a multiple requestors, multiple server queuemechanism in a multiprocessor system.

The hardware arbitration, roster rights checks and conflict resolutioncircuit may be a several orders of magnitude faster than resolvingconflicts or arbitration through methods that evolve going throughCALL/RETURN chain of context switch changes when going through thehierarchical software structure of the OS.

The “current Mentor rights table”, which is the content of the Mentorroster checks circuit, is also a “Variable”. The Variable is set up bythe system and therefore belongs to a system thread that is responsiblefor its set up and maintenance. The fact that this Variable may bechanged not only by its owner but also by input activity (of otherMentors) makes the “Mentor roster checks” the Infinite Variable dataclass/type, that for organizational reasons is connected to the highspeed I/O FU of a computing device, assigned as a Roster supervisor.

The Rosters may also be designed to check operations that arecharacteristic of malware intrusions for example not only does theRoster check if this Mentor have the official rights to access/changebank account information but may ask why is the customer changing bankaccount information at 3:00 AM local time through a link from Tangier??

The technical challenges in including a Roster check circuits indistributed multiprocessor systems include:

-   -   1. Keep the Roster check circuits as part of the basic model        (i.e. it is a system “Variable” and not some hidden gizmo).    -   2. Avoid single point of failure, so provide two or three        (voting) copies.    -   3. Have very fast response time.    -   4. If at all possible proceed without check confirmation delays        and change system operations only upon confirmed violations.        -   An alternate embodiment of the “Roster checks circuit” is a            method of using shared memory that includes a program in            shared memory that performs the “Rroster checks circuit”            described above.

Since Mentors have direct access to shared memory they may send theirassigned Variable space definition and other rights checks to an OSMentor roster checks routine upon assuming a Mentor assignment. Sincerights violations are rare, the Mentors may speculate on “no violations”and rely on check point software strategy.

The downside of this software approach as compared to the hardwarecircuit method is that it relies on shared memory thus may aggravate the“memory wall” problems.

The use of Mentors in Register Machine Emulation

As was previously mentioned, the emulation of a Host machine provides aroute for porting Register machine based software. The “Host” mode inthe processor may allow the results of two program runs using theidentical system conditions to be compared, a setup that is very hard toobtain using “side by side machines” Host, on the one hand, and systemsas described herein, on the other hand.

Assuming that the Host emulator has been properly certified as would bethe case for a new version of the native Host core, comparing programrun that show incompatible results helps isolate whether the codeporting problems are in the program, compiler or the basic model thatthe program is based on (mathematical, economic, etc.). The runs, forexample may point out problems in the HLL code models. The current codemay have “good enough accuracy”; however the program may fail the boundscheck and other software robustness criteria. Please review the SimpleRelaxation Algorithm section for an example.

Returning to FIG. 13, in embodiments described herein, any of the LUsmay be assigned control top level Instruction Interpreter(s). TheInstruction Interpreter may convert the machine language code to theintermediate Dynamic VLIW code. The Dynamic VLIW may then executed bythe Dynamic VLIW control layer in FIG. 13. Now assume the current topbox in FIG. 13 is the “Host” Language Unit emulating an existingregister RISC “Host” instruction set.

In order to perform the assignment, the “Host LU” requests and receivesfrom the Resource Allocation circuit (whose function, as stated, issimilar to the memory page allocation circuit in virtual memory basedmachines), the assignment of four or five Mentor circuits and four orfive corresponding Bins.

One Mentor circuit and the corresponding 512 words Bin (32×16) isassigned to operate in single Variable mode and handle the emulation ofthe 32 machine resisters. If the Host has two sets of registers, one forfixed and one for floating point registers as is the case with the IBMPOWER, two Mentor circuits with the two corresponding 512 words Bins areassigned to emulate the two sets of registers.

Next, three additional Mentor+Bin pairs operate in dimensioned Variablemode and are assigned the task of instruction cache, data cache andshared data cache respectively. The three Mentor circuits, when servingthose dimensioned Variables, operate the Variable's Bin content usingsixteen, 64 word Frames/Bins data sections, and thus they are allocateda 1024 words Bin each.

If the three spaces are kept separate in main memory of the Hostprograms, bounds checks may be performed. If the spaces in main memorymingles all three (the typical case) the three are defined as the fullmemory size and the intrusion bounds check mechanism may be turned mute.

The Mentors of the data cache Variable and the instruction cacheVariable operate in Bin data segments and are typically updated in theBins only to minimize Memory traffic Bin segment content. “Writethrough” to memory may be delayed as long as practical and is done onlyif the segment has been altered by the program and/or is re assigned tohandle another memory location and/or program operation is terminated.

The shared memory is used to memory space that interacts amongprocessors is semaphore space it is used by the Mentors handling sharedmemory. Regarding this space the Mentors operate in immediate memorywrite through (semaphore) mode. As soon as a new result in the Bin is nolonger under speculative condition the “memory write through” isperformed. If the emulation includes the rarely used vectorinstructions, an additional Mentor/Bin may be included to emulate thevector registers functions.

Despite the fact that the “Mentor+Bin” set which are assigned to emulatethe registers are capable of writing into memory, the Host emulationdoes not allow writing into memory as the content of “emulatedregisters” in the Host machine typically do not have a correspondingMain Memory location (the PSDW does not include register content). Allregister to memory or memory to register operations are therefore doneby transferring operand from the register Bin to the data cache Bin orshared memory Bins for the memory write operations. Similarly, memoryreads are done through transfers from the data cache Bin(s) and sharedmemory Bin(s) to the register Bin(s).

Parallel Operation Using Multiple Language Unit (LU) Sets

In various embodiments described herein, the data structure employs 8words block operations under a single program thread control. Through asomewhat more complex design of the interconnect crossbars, thecrossbars may operate either as a single processor hardware thread aspreviously discussed, or as 8 processors, i.e., 8 hardware processingthreads. Each separate hardware thread may be logically connected to oneunit out of the sets of 8 Language Units. The system, in this operationmode uses 8 Host LU's or 8 DONA LUs. Each “hardware thread” iscontrolled by a pair of Language Unit+Dynamic VLIW program controlsection.

Implementation of a data structure including the 8 “hardware threads”option may involve somewhat more complex Bins to Functional Units andFunctional Units to/from Bins two layer crossbars, the two crossbars(alignment crossbars and the transfer crossbars) may need to switchpositions. The Bins to/from Main Memory design need not change as cacheto/from memory transactions are done in blocks.

Bounds Protection System Wise

Bound protection per Variable provided by the Mentor circuits may beeffective as long as the Mentors covering the Variables are active,however once a Variable's Mentor is dormant and specifically the Mentorhas not been reserved to serve a “conscious dormant thread” themechanism no longer protect the Variable's space.

Embodiments as described herein allow for the hierarchical definition ofVariables such that the whole working memory space of a program,containing program and data are also defined as a Variable, so that theentire program space is protected when parts may be dormant. In thiscase some Variables may be covered by more than one Mentor Circuit. OneMentor for example is covering array “A” and another covering all theVariables in a program including array “A”.

The complete system solution lies in the proper integration of the “finefeature” Variable bounds protection capabilities provided by the Mentors(including when applicable the Mentor roster circuits) in thearchitecture combined with the “coarse feature” page protectionpresently afforded through the Virtual page management systems.

The system may synchronize the Mentor protection with page protectionoperation such that elements that are dormant in the VMF system are alsoco-located in pages on the hard drive (or other secondary storage andtransfer of pages from permanent to dynamic storage). Page swap is agood time to check the credentials of the requestor and verify securityand credentials before dormant Variables are reconstituted to the activestate.

Dynamic VLIW Control Section

Dynamic VLIW Control Word Model

As used herein, the term “Dynamic VLIW” may be used in the same contextas “microprogramming” was used in the architecture of the SEL 32/55. ADynamic VLIW word either generated by the Instruction Interpretersection (Dynamic) or previously compiled for a specific function is usedin managing all the control activity over data structure during onecycle. Each Dynamic VLIW word may control activities on behalf of one ormore machine language instructions, and the actions pertaining to eachmachine language instruction are typically found in two or more VLIWwords.

In typical deployments of VLIW techniques, the VLIW directly controlsall elements of the data structure. In embodiments described herein, theDynamic VLIW does directly control some elements (like FUs),nevertheless in some embodiments, most elements including the Frames andcrossbar interconnects are controlled indirectly. The Dynamic VLIWcontrol may issue commands in terms of logical Channels and logical Binoperands to the Mentor circuits. The Mentor circuits may translate thelogical Variable addresses to physical Frame addresses and with thecooperation of crossbar control logic the Dynamic VLIW commandinformation is translated to crossbar configurations. It is recognizedthat there are many methods for implementing pipeline control other thanthe Dynamic VLIW control method described herein. Furthermore allcontrol design methods may benefit from architecture approachesdescribed elsewhere herein, such as the use of the Mentors, Frames, Binsor crossbars interconnects. A big advantage of the method describedherein is its applicability to spare parts replacement strategies thusenabling large footprint device at low cost. Most other designtechniques tend to place distributed, typically small complex “one of akind” control circuits throughout the design.

Dynamic VLIW Instruction Format

The Dynamic VLIW instruction flow as seen in FIG. 19 may come fromdifferent sources (1) the DONA instruction-decoder/Dynamic VLIW-codercircuit residing in the DONA LU, (2) Host instruction-decoder/DynamicVLIW-coder circuits residing in the Host LU and (3) from the VLIW cache.The VLIW cache may be used to house functions (sine, cosine, I/Ointerrupt, etc.) coded in the VLIW format as well as languageinterpreters (Java, C++, etc.).

The systems bring up may use the “Host” decoder/Dynamic VLIW-coder totransport existing software to the processor described herein. Hostprograms may be however limited by their register architecture concepts.Programs compiled to DONA are able to take advantage of the full rangeof error and malware protection tools larger parallel performance scope,conscious thread techniques and debug support tools.

The VLIW instruction words contain two parts, the VLIW Sequence control(FIG. 20) and the data-structure control (FIG. 21). The data-structurecontrol contains literals and control signals for managing theunderlying, pipelined data structure.

The VLIW Sequence Control Section,

Consider the sequence control basic control format type “000” in FIG. 20it contains the following fields:

(1) OP-Code: A 3 bit op-code responsible for choosing the next VLIWinstruction address.

This VLIW sequence control does not have an instruction counter; eachVLIW contains the address of the following VLIW(s). In the VLIW program,the address of the next VLIW the LSB is “0”. Each VILW word containsbranch test information that, if the test turns true, the LSB ismodified to “1”. Thus, VLIW words accommodate condition tests eachcycle, without slowing the VLIW program flow. For a VLIW that uses theaddress/literal field as a literal, the next VLIW address is presentaddress +2.

OP Codes

OP code (000) is the “normal” VLIW operation. The address sectioncontains an even address in the VLIW cache. The next VLIW instruction tobe fetched is the one at the even address when the Test condition in theVLIW's test field is false (not taken), or the following (odd address)VLIW instruction if the VLIW Test condition is true.

In addition to the Test condition that modifies the least significantbit of the address, there are 16 VLIW addresses that are chosen by theTest logic hardware in order to handle exceptions. If any of theexceptions occur, the VLIW Sequence control (or decoder/VLIW-coder)stops initiating new operations into the pipeline, and allows the VLIWprogram to bring outstanding operations already in the pipeline toconclusion. Following completion of outstanding operations the flowcontrol proceeds to one of 16 exception locations in the VLIW program, alocation selected by the event that triggered the exception logic.

The choice of which conditions participate and gets selected by the Testfield and which conditions may cause any of the 16 exceptions is adesign implementation choice, based on the target (DONA instruction,Host instruction, Function, etc.) as well as the architecture of Mainmemory (a virtual page fault, for example, may cause such an exception).

The handling of the special and exception conditions is typicallyidiosyncratic and specific to definition and handling of zero, overflow,underflow and the handling of privileged instructions virtual memory andI/O architectures. Normally, the first exception discovered is the onereported and any following exceptions are ignored, but that is notalways the case. The good news is the number of parameters involved isrelatively small, usually just bits including underflow, overflow,condition codes (or DO/ALL loop setup in DONA), page fault and few PSDWbits. The specific logic mapping may be handled by logic or a relativelysmall ROM table.

OP code (001). This op code operates the same way as OP code (000), withthe exception that Address/literal field is used to provide a literaloperand to the functional unit(s). In OP code (001) the next VLIWaddress is present address +2. The reason for this field sharing is thatliteral operands are typically used in setting up a highly iterativeoperation, but seldom appear in the inner microcode loops. Thus, thisfield sharing should have a negligent effect on performance.

OP code (010) is a decode jump (also called scatter jump) where thelower 6 bits of the address come from a data field in the underlyingdata structure. The VLIW program may jump to one of 64 locations. Thejump base location is given by the Address field. The Test field selectsthe specific source of the six bit Test field. Thus, in this case, Testcondition selects a scatter jump field and may also select a testcondition.

OP codes (011 and 100) are subroutines call and subroutine return,respectively. These VLIW level calls and returns and may be used forexample, for microprogramming interpreter routines (Java, C++, etc.) andDynamic VLIW complex Functions.

Op codes (101-111). The rest of the sequencing control op codes arereserved for hardware error isolation, specific to the manufacturing andtest process and for future use.

(2) Address Location: A 1 bit field that specifies that the branchaddress (in the Address field) is an address within the VLIW cache, orthat it is a machine language program address. Some exceptions and I/Ointerrupt conditions are first handled by the Dynamic VLIW cache thatsorts out the issue and control is then transferred to a machinelanguage program.

(3) Mapping Bit: The Mapping Bit is used in emulating a current Registerbased ILP “Host”, where high use sections (innermost loops, etc.) may beconverted by special software to VLIW Functions and program flow mappedto VLIW flow.

(4) Test Field: The Test Field is an 8 bit field that selects one ofmany conditions under which the program should perform a VLIW programbranch. A branch may be caused by an overflow, underflow, an addressingfault, or end-of-loop conditions sent from one of the 16 Index counters.As noted the condition modifies the LSB of the next address from 0 to 1as each VLIW word contains a branch option.

The Test conditions may simply select a single hardware logic line ormay participate in a logically compound circuit. The logic circuit forcompound conditions is a specialized circuit fitting the Host or DONAmachine instruction set, PSDW (Program Status Double Word), andcondition codes. A compound condition would be used where multiplefunctional units in the pipeline may indicate underflow or overflow.Outcomes of some tests are machine idiosyncratic. For example, thearchitecture may require that an overflow resulting from a division byzero is set results to zero. The specific LU logic circuit used ininstruction decodes/Dynamic VLIW code has special logic or ROM circuitsto implement those specific functions.

Additionally as stated, the test field may be used to simultaneouslytest for some conditions and specify the location of a scatter jumpfield. Therefore the test field logic is implementation dependent, andits support circuitry may be composed of a small ROM table or justcombinatorial logic.

(5) Address Field: This field is typically used to determine the addressfor the next Dynamic VLIW word, whether it is in the VLIW cache, inbuffer memory or in Main Memory. The field's size is 32 bits or largeras dictated by the size of the of the DONA or “Host” program addresswithin a virtual memory page. In case of a jump whose address is theoriginal machine language format, the content of the field is added tothe program base address or concatenated to the program page address inorder to form a virtual memory address. The choice to add or concatenateis dictated by the virtual memory system architecture. For OP code(001), this field is used as a source of a literal to be sent to a Binor functional unit as an operand.

(6) Outstanding Pipeline Length: This 4 bit number specifies the largestlength of pipelined operation activated by this VLIW word. This fieldsize is increased if the design includes functional units whose pipelinecontrol cycle length is larger than 16. This number is used by the VLIWSequence control for knowing when speculative condition has beenverified. In a miss prediction case the flow control typically directsthe VLIW program to continue until it concludes all outstanding actionsalready in the pipeline, but without issuing new operations.

(7) Iteration Counter Control. This 6 bit field, including additionallogic is used for manipulating the current (innermost) iteration counterand in order to repeat the execution of full loop sequences in DynamicVLIW code. This feature is used for extending the instructioninterpreter's scope from “words” to “sentences” i.e. from“comprehending” one instruction at a time to “comprehending” a fullinner loop or a program section. For efficient operations, programs,specifically the inner loop routines are interpreted as a programsection or “program sentence” rather than looking only at the flow ofindividual instructions. A simple energy and time saving advantage isthat once the innermost loop has been efficiently interpreted (includingConditional Branch's preferred directions are accounted for and the loopis fully Dynamic VLIW coded) the instructions interpretation process maybe turned off until loop completion.

Stated differently, the LU may prepare an optimized Dynamic VLIWsequence for an inner loop and enter it once into the VLIW buffer. TheLU therefore may do a complete inner loop coding before the onset of theinner loops for inner loops that do not containing branches or it may doso for inner loops containing branches after a few iterations or afterperforming a “dry run” that determines the (prediction) bias of theconditional branches inside the loop, as the expected speculative branchdirection significantly influences the command fields selection in theVLIWs of the VLIW coded loop.

Once the inner loop is loaded into the VLIW buffer (optionally includingseveral VLIW words past loop completion for “priming” following VLIWsequence), the LU may ceases to decode the program until the loop isterminated.

The DONA machine language uses LOOP END instead of Conditional Branchinstruction to terminate enumerated loops (loops controlled by an Index,i.e. DO or FOR and ALL loops using Indices to terminate loop operation).The DONA LU may distinguish immutable from mutable program sections (SeeSimple Relaxation sections) and make the DONA LU coding of VLIW innerloop program a new and attractive feature of the DONA program operation.(In present ILP machines loop iteration typically uses speculation basedConditional Branch instructions, a process which interferes withinstalling bounds checks.)

Loop iteration counter codes

“000000” No iteration counter activity.

“01AAAA” This Dynamic VLIW sequence should be repeated by the countgiven in Index AAAA. This feature is typically used in priming sequencesfor functional units with a long pipeline where the same VLIW line needsto be repeated several times. (ALL i=0, K; (A(i)*B(i)+C(i); END ALL;

using a 5 stage MPY FU requires 5 MPY priming cycles prior to using thebypass from the MPY to the ADD FU for entering the full operation usingboth ADD and MPY.)

“100000” End of Dynamic VLIW loop iterations controlled by this Index.

“1XXXXX” Other functions that will be defined as part of theimplementation.

The buffered Dynamic VLIW may include one inner loop or a set ofmultiple loop levels. There is a tradeoff which includes Dynamic VLIWbuffer size, the size of the loop's code, the length of setup time andLU sophistication versus the higher compute per cycle efficiencyachieved when the Dynamic VLIW is configured for maximum computeperformed per iteration cycle. The tradeoff in each particularimplementation may be made by the LU based on the Index value, i.e. thenumber of expected loop iterations.

(8) Control Type. This 3 bit field determines the data structure typethat is associated with the VLIW, as explained next in the DataStructure Control Section. This 3 bit field is used to differentiatebetween (a) normal data structure control activity and (b) activity thatsets up the mentor circuits and loads them with a particular Variable'sparameters, including memory location, type of variable, operand size,and array dimensions. This activity also manages the virtual Bins. Whenthe number of Variables in the program exceeds 48, physical Mentorcircuits need to be reassigned. The content and format of the datastructure control word section (FIG. 21) will differ based on thecontrol activity type.

Data Structure Control Section

Control activity “000”. The standard control activity and its associateddata format are shown in FIG. 21. This activity applies to the transferof operands to the functional units from the Bins and bypaths, the setupof tarmac registers, the instructions to the FUs, and the setup of pathsfor results returned to the Bins. All data transfer activities in thedata structure are done in terms of 8 words or less as indicated by thecontrols. A maximum of four sets of 8 FUs may be activated by a singlecommand word.

Per FIG. 21, the amount of data structure activity per cycle iscomprised of (1) zero to four Bins to FUs transfers, (2) zero to fourBypaths to FUs transfers, (3) zero to four FU commands, and (4) zero totwo FUs to Bins transfers.

(1) The format for: Bins to FUs is a 24 bit control field whose detailswere given in FIG. 16.

(2) The format for: Bypaths to FUs transfers. This is a 10 bit fieldwhere four bits specify the source bypath or the input tarmac (at thisleg) and six bits specify the FU that is to receive the transfer. Note,bypaths are connecting within a group of FUs where the groups are forexample; floating point, fixed point and byte stream operations.

(3) The format for: FU commands, this is a 5 bit field. MSB=“0” is astandard FU command (ADD, SUB in an adder, AND, OR EXOR in an FU, etc.).For a single input FU (shift for example), the second leg receivingoperand through a literal or tarmac register provides the amount ofshift.

(4) The form of FUs to Bins transfers is the same 24 bit form as theBins to FUs transfers. Control activity type 001. Same format as 000while using field sharing. Only a single “Bins to FUs” channel specifiedand the rest of the command fields contain a literal operand. Note thatthe results return channels are available as they will typically contain“return flight traffic” of on-going machine language instructions wherethe operands were sent to the FUs by previously issued VLIW(s).

Control Activity “010”. This Control Activity Transfers Indexing Formulato Mentors.

Different than present processors where the machine language code hasonly one destination through which the Instructions Interpretermanipulates the data structure, the DONA code has three different typesof DONA code segments.

The first type of code segment is (1) the DONA algorithmic code, whichis using Variable names instead of memory and register addresses. Thecode, when converted to VLIW of format 000 and 001 above will directlymanipulate the data structure. The VLIW code is typically executed fewcycles after being encoded by the LU.

The second and third types of codes are codes destined to (2) configurethe Mentors and (3) supply Indexing formulas to the Mentors. The Mentorsare the destination of two types of code segments containing Variabledefinition (see FIG. 17 and FIG. 18) and Indexing Formulas. The IndexingFormulas code segments are described in the following sections.

There are two methods to consider for handling the transfer of theVariable definition and Indexing Formulas to the mentors. The firstmethod is to place the information in the VLIW in a format dedicated forthis transfer task and few cycles later transfer the information fromthe VLIW to the mentors. The second method is to leave the informationin the DONA machine language code file and few cycles later, arrange atransfer from the DONA code file to the appropriate Mentors. It may be adesign choice in particular implementations which method to use.

For strictly DONA instruction interpretation the second method lookspreferable as it is simpler and more compact, one need not provide spacein the VLIW for ferrying the Variable definition and Indexing Formulas.

However, a key route for incorporating new capabilities is to enable theuse of the new concepts first as VLIW coded functions. The role and useof those Functions can be field tested by the user community beforetheir official incorporation in machine language and especially incompatibility to the HLLs be they C++, Java, FORTRAN or other languages.In order to take this route a better fit is to allow placement of theIndexing Formulas as part of the VLIW code, so the discussion belowexplains this particular VLIW format.

In Control activity “010”, the content of the control section, withexception of the (FU=>Bin transfers) “return flights, results” controlscontains the indexing formulas sent to the Mentors. The Indexing Formulaformat is given in the parts describing DONA.

Control activity type 011: Routine Variable Initiation. This controlline typically appears at the onset of a routine and points to a list ofnew Variables as well as constants that need to be assigned to Mentorcircuits in order for the routine to become operational. The VLIWcontrol word contains the Bin which includes the Initiation table andthe location of the table inside the Bin. The Bin may be the Program Binor another Bin specifically assigned by the compiler or OS for thisfunction. The table contains the memory base addresses and size of allnew variables to be initiated by this routine. The Dynamic VLIW programflow control circuit (FIG. 19, FIG. 13) will read the table informationand request Mentor assignment of Mentors to activate the Variables. Oncethe Mentors are allocated the table information plus the Mentor and Binresources assigned to them may be later converted to Virtual MentorFiles (VMF) to be loaded into the VMF stack if the Variable is turneddormant. Following the Variables' activation the operations the routineis commenced.

When LRU methods are used for resource allocation Mentors may be turneddormant, the LRU algorithm should retain all recent routine tableactivation addresses such that the reactivation of a routine is done ina few cycles as both the previously used Mentors and the VMF is alreadyon file.

Control Type (1XX). The additional formats are reserved for future useswhich includes the formation of high speed I/O communications setups forprocessor teams.

VLIW Instruction Issue Control Circuit Design

FIG. 22 is a block diagram example of the combined operation of aLanguage Unit (LU) and the Dynamic VLIW Flow control and instructionissue circuit.

VLIW Control Circuit Logical Flow

Starting from the top of FIG. 22, machine language program flow comesfrom the Program Bin. In conjunction with information from the programstatus double word (PSDW) the aligning logic circuit breaks the flowinto individual machine language instructions (Host machine language) orOps and Variables list in DONA.

Individual instructions plus branch prediction information for branchinstruction(s) is placed into four queues in an equivalent form to afour instructions per cycle issue ILP. The priority among the queues isroving, such that the queue with the longest time in the queue has thehighest priority. Additional considerations, instructions with a longpipeline (outstanding completion) should preferably not place in thesame queue of instructions that they may interfere with their out oforder issue.

Each cycle, instructions in the queues verify that the operandinformation in the Bin (DONA) or in the Bin assigned to the Register theinstruction uses (Host) is present or will be available at VLIW issuetime. If there is no conflict in operand availability or no otherconflict imposed by higher priority queue, the interpreted instructionis placed in the outgoing VLIW and the outstanding operation of theinstruction (store the results) is placed in the outstanding (return)operation section in a VLIW N cycles later. N is the pipeline delayassociated with the FU performing the operation initiated by the justissued VLIW. The Issued Dynamic VLIW control words are placed in a VLIWcircular file such that once an inner loop (or other iterative programsection) is placed in the Dynamic VLIW circular file the control maycease instructions interpretation as the iterative section is fullyplaced in the Dynamic VLIW buffer.

The instruction decode of the queue may be implemented by a fast memory,typically ROM whose inputs are the OP code section of the instructionplus some PSDW and other information (branch prediction; honor/ignoreunderflow/overflow, etc.).

The Dynamic VLIW instruction stream may come from the instructioninterpretation logic (FIG. 16) or Dynamic VLIW cache preloaded from asoftware compiled VLIW program Function, where the Function is compileddirectly into VLIW as a method of defining new functions or speeding upparticular inner loops or developing and testing VLIW encoding methodsprior to implementation of those methods in next generation versions ofcomputing device design.

In some embodiments, a design implements VLIW techniques in the contextof Mentors and of dual control levels. Specifically the Dynamic VLIWdoes not directly control the data structure but is used to direct theMentors.

In some embodiments, conversion of repetitive loops to VLIW formatdynamically form machine language and then using the (optimized) DynamicVLIW code to run the iterative section of the program.

The method described in FIG. 22 fits mostly for the “Host LU” andexisting four-instructions-per-cycle interpretation technology. Weexpect that the DONA LU may start this way but the LU section willquickly evolve to a different form as powerful linguistic forms andassociated terms are introduced to the instruction set. Followingsections further discussed “plural form” in lieu of “N times singular”.Consider the US declaration of independence starting with “We the people. . . ”, the prevailing present “N times singular” approach requires “Wethe people” to be translated to a list of people by name, state place ofresidence, or other means.

Presently machine languages might in a sense be thought of as in the“baby talk” era; no plurals, no pronouns. Translate JFKs speech “Ask notwhat your country can do for your . . . ” to baby-talk and you come upwith a gobbledygook like “Jimmy should not ask what Sara, and Becky, andBilly and . . . ”.

This is not advocating matching machine language to human language,which is a different field concerned with screen display, “who owns thescreen”, voice recognition, and other technologies as well as beingengaged in battles of wills between ways customers wants tasks done andways the (software, service) suppliers want tasks done.

While computers have been named many times “thinking machines”, they arenothing of the kind. Watch a cat recognize strange movements and youwill surely recognize a period where the tail is twitching and the catis “thinking” prior to “acting” on this information. Computers arenothing of the kind, they do not “think”, they only “act” according tothe prescribed algorithms. Once someone will define the relationsbetween “thinking” and technical devices that performs “thinking” wewill engage in a different field of engineering.

There are many critical aspects of human communication, especially inthe social and political areas that relies on vagueness, exaggerations,lies, incoherency and even acting obtuse.

Meanwhile, as we are implementing “acting machines” there are hosts oflinguistic concepts that should be added to the computer machinelanguage, enabling computers to better do their current and their new“acting” jobs. Vagueness, lies, exudations, incoherency and actingobtuse certainly do not belong into the new machine language dictionary.However concepts like; plural, pronoun, class, inheritance,polymorphisms, etc. do belong there. For an interesting example, thearchitecture defined herein might need to deal with US 14^(th) amendmenttype of issue (in security), as “Infinite Variables” are not “nativeborn” in the processor's memory.

The “language Wall” might be a bigger problem than the “memory Wall”,“power Wall”, and “IPL Wall”. In this architecture, the Language Unit(LU) is specifically separated from the Dynamic VLIW Flow control inorder to be able to upgrade the LU with minimal disturbance to the restof the hardware structures.

As stated the instruction interpretation flow depicted in FIG. 22 fitsthe present (Host) instruction interpretation methodology and probablywill do the initial job for DONA. However significantly betterperformance is expected from LU designs that take advantage of pluraland other language features to interpret entire code sections. Forexample in optimizing (into VLIW code) loop iterations once instead ofrepeatedly re interpreting the same instructions set in each loop cycle.

DONA Instruction Set Architecture

Two examples are described herein and demonstrate the need for includingthe concepts of “plural” in machine language. The examples are the useof ALL instead of DO when loop index is strictly used for enumerationand the use of IMMUTABLE (see Simple Relaxation Algorithm) to definerevision control.

The architecture defined herein takes advantage of the nativemicro-parallelism of many algorithms. Presently this native microparallelism is blocked through two related mechanisms, the first is theserialization typically imposed by compiling into register architecturescode where all array elements participating in the loop's operation mustpass through a single machine Register. The second is the lack of“plural form” in computer HLLs. In analogy to English, HLLs use the“N-times-singular” form to deal with plural subjects. The plural form of“Company! About-face!” is stated as: “DO I=1, 120; Soldier (I)About-face; END DO; The intent of whether the soldiers should turnaboutsimultaneously or turnabout one after the other (wave) is obscured.

A linguistically powerful machine language may add to performance,robustness and may call for sophisticated instruction interpretationmethods. The machine language improvements that are included in thisdisclosure (ALL, IMMUTABLE, MARK-SEGUE, UNDO) are mainly handled by thecompiler as well as additional DO and ALL opcodes that replace the loopend Conditional Branch Op Code, and use in loop termination based onindex value. (MARK-SEGUE sets undo point for debug support.) Thisapproach enables micro parallelism and avoids the addressing over-reachof the branch prediction mechanism that interferes with read boundschecks. The hardware address bound check may be provided by the Mentors.

A hardware design of the instructions interpretation that is usingpresent techniques for 4 instructions interpretation per cycle issufficient for the early versions of the hardware section of instructioninterpretation, since most limitations in ILP machines are not due toparallel interpretation but to operand interlocks caused by the“Registers” namespace, specifically in DO I=1,N; A(I)<=B(I)+C(I); ENDDO; all members of arrays A, B, and C are paraded through threeindividual Registers assigned to A B and C. Some versions may use moresophisticated techniques that “comprehend” complete loops, smallroutines or routine code segments.

The mapping domain consideration. In some embodiments, the system usesthe Variable namespace mapping. Variables may be given a logical ID in a256 Variable namespace in a thread (This may be done totally independentof whether parallel multithreaded operation is used by the processor).At the next mapping level the mapping accommodates 256 threads in a“logical processor” namespace. Mapping of the namespaces to physical andlogical memory pages may be done through a set of namespace tables. Thenumbers of variables per thread and number of threads were chosen forreasons of code compactness. Many namespace mapping methods may be usedto implement different DONA formats including using existing memorysystem architecture, direct logical or physical memory addresses asVariable ID, despite the significantly larger code they require andother problems they pose.

The DONA Machine Language Formats

DONA code contains code sections in three types of different formats,two formats are used to set up the Mentors and the third has the samefunction as regular machine language code. The Variable definitionexample was given in FIG. 17 and FIG. 18, indexing formulas discussedbelow and the DONA algorithmic code proper discussed next. Havingdifferent forms for the DONA algorithmic code and the indexing formulais a design option taken here for producing compact DONA code.

Dona Indexing Formulas

FIG. 23 illustrates an example of a DONA indexing formula. Operandindices are based on addressing formula using the values of constants,Variables and loop indices. Examples of use are formulas such as(i+1,j), (i*2), C(i−2,j+2,k). The addressing formulas and their IDs aretypically sent to the Mentor prior to loop onset.

Consider the following example:

AB(i,j)<=(A(i,j−1)+A(i−1,j)+A(i+1,j)+A(i,j+1))*0.25

The first step is converting this expression into DONA indexing formatfor the four indexing formulas corresponding to:

(i,j−1), (i−1,j), (i+1,j) and (i,j+1)

During operation, the assignment formula to a Mentor is done by sendingto the Mentor responsible for the variable “A” the following expressionwhich defines formulas 7, 8, 9 and 10:

0,1−=>7; 1−,0=>8; 1+;0=>9; 0, 1+=>10;

The indexing formula is coded in a modified Reverse Polish notationstatement including the operators “+”,“−”, “*”, =>” and “,”.

The DONA algorithmic statement for the expression, in a different formof modified Reverse Polish notation, is changed from:

A1(i,j)<=(A(i,j−1)+A(i−1,j)+A(i+1,j)+A(i,j+1))*0.25

To:

A1<=A₇ A₈+A₉+A₁₀+0.25*

The algorithmic statement and the indexing formulas while both usingReverse Polish use different formats and character sizes. The Indexingformulas use a 4 bit character set and the format is described below.The algorithmic statements use 10 bit characters and are defined later.The main reason for the differences is character size and that theIndexing formulas are used in a context dependent format relative to thecurrent values of loop indices. This type of implementation is not arequirement of the architecture, just an implementation approach.

DONA format for Indexing formulas consists of four bits characters.Different than the norm in programs, the indexing formulas are contextdependent in that they are relative to the current loop nesting level.

If a Variable operand is using the nominal indexing case for exampleA(i,j,k), no indexing formula is needed as the Variable's operandaddress is determined according to the indices of the nested loop andthe Variable's parameters (operand size, dimensions and base address).Index level “0” assumes the innermost loop index, the second index level“1”, etc.

In case of an expression like: B (i,j)<=A (i,j)−A (j,i); The Variableoperands for B (i,j) and A (i,j) are simply addressed by the Variablename “A” and “B”. A (j,i) however is not the nominal form and does needan Indexing formula. In the Indexing formula for A (j,i) the Indexcharacters 0101 and 0110 are used to indicate respectively a one higherand one lower index layer than that of the nominal index level case.

The operand addressing is based on the nominal case (i, j, k) indiceswhere “i” is the innermost loop (index “0”), j is next (inclosing) loopindex “1”, etc. One formula is possible for each Variable dimension.

Consider the Simple Relaxation equation;

A1(i,j)<=(A(i,j−1)+A(i−1,j)+A(i+1,j)+A(i,j+1))*0.25

Note that “i” refers to the current innermost loop counter value and “j”to the next (outer) loop value. Absence of entries implies the use ofindex values unmodified.

The Indexing formulas for (i,j−1), (i−1,j), (i+1,j), and (i,j+1) sent toMentor A (17) as Mentor Formulas 6, 7, 8 and 9. Note A1(i,j) does notneed an Indexing formula as it is the nominal case. In a shorthandnotation the formulas are posed as:

17 ‘6’, ‘−1’; ‘7’−1 ‘,’; ‘8’+1‘,’;‘9’, ‘+1°;’

The DONA format for indexing formulas is: Variable ID (8 bits) followedby a set of Formula IDs (8 bits) followed by the formula proper in 4 bitcharacters (nibbles).

Continuing the above example the Mentor of Variable 17 is sent fourformulas.

-   -   Formula 6. i, j−1.    -   Formula 7. i−1, j.    -   Formula 8. j+1, i.    -   Formula 9. i, j+1.

In binary form:

-   -   00010001 17    -   00000110 0000 0010 1001 1111 ‘6’, ‘−1’ ‘;’    -   00000111 0010 1001 0000 1111 ‘7’−1’, ‘;’    -   00001001 0001 1001 0000 1111 ‘8’+1’, ‘;’    -   00001010 0000 0001 1001 1111 ‘9’, ‘+1’;

As we have stated, this is an example of coding Index formulas and onecan substitute many other code formatting variations.

DONA Main Algorithm Code

FIG. 24 illustrates an example of a DONA main algorithm code. The DONAalgorithmic code will be more familiar to compiler writers than tomachine designer as it is basically a reverse Polish notation of thealgorithmic code using 10 bit characters. The 10 bit code characterscontain a 2 bit definition and an 8 bit entry. For compatibility withword/byte memory architecture the code contains 5+ byte sections. Thefirst byte contains four sets of 2 bit definition for the following fourcode entries.

“00” Variable ID, the code entry contains the logical ID of one of 256Variables in a program thread and uses the nominal indexing.

“01” Variable ID followed by the ID of the Indexing formula ID for theVariable.

“10” OP code or Delimiter, the code entry contains either an op codelike ADD, SUB, MPY, SHIFT, FADD, FMY, etc. or a code delimiter. In thecode delimiter we include DO (FOR), IF, THEN, ELSE, END, “;”, CALL,RETURN as well as IMMUTABLE and ALL. Note that the approach is not totry to minimize in machine language standard HLL concepts to simpler (ormore simplistic) elements, since this subtracts relevant plural forminformation and information useful for security and robustness, webelieve that making the processor smarter is well within technologyreach and maximizing transfer of relevant information to the hardware isthe right way to proceed for better software security and robustness. AnOP code entry may be followed by label or labels denoting targets ofbranches or an address label (IF) of a parameter list (CALL). If the OPcode requires a label or parameter(s) the OP code defines the size ofthe following label or parameter(s).

Character code “11” and a “0” in the MSB of the literal byte. A 7 bitliteral whose value is +/−63.

Character code “11” and a “1” in the MSB of the literal byte. IndicatingLong literal, the 7 bit entry denotes the number of bytes (N). Theactual literal is in the following set of N bytes.

The use of Variable IDs and thus mapping, instead of memory addressesmay make DONA code significantly more compact than register machinelanguage code.

It is not so important whether one classifies a language expression asinstruction (OP) or delimiter (CALL, DO, FOR, END, “;” “<=”, etc.) wherethe delimiters are mainly imply context information rather than specificactions. What is important is how to transfer the relevant informationto a processor and specifically design smarter processors that use thecontext information to overcome the “intellectual bottleneck . . . [of]. . . word-at-a-time thinking”.

A partial list of new machine language elements. DO (or FOR), IF, BEGIN,END, ALL, IMMUTABLE, MARK-SEGUE, UNDO. MARK-SEGUE and UNDO are used toenable the programmer to use the hardware undo mechanism used for branchprediction recovery to be used for other programmable “undo” reasons.

As a practical matter, the full list of new machine language elementshould be chosen in conjunction with the choice of: (1) “mapped IDapproach and mapping strategy” or the “name=memory-address” approach.(2) The software migration route should select as a base. Our presentcandidates are interpreter languages (Java, C++, Visual C++, etc.) eachforming a list of a well-organized and tested set “conceptual units” forthe starting base. (3) The bit compatible versus functional route shouldbe made for Host emulation. (4) The migration of the DONA route setshould be augmented by key new elements of this computing device, likethe VMF as well as the plurality support operators needed to demonstratebasic performance and operational characteristic of the new computingelements.

A software migration route from a C++ or similar interpreter does notnecessitate that every instruction (primitive) must be implemented by acorresponding hardware (machine language) primitive, low use and complexinstructions or data types may be implemented by microprogram and/orsoftware routines that based on simpler primitives.

From product development point of view, the operative use of C++ orother intermediates language interpreter that deploys the ObjectOriented discipline is very significant. The tasks of software migrationand hardware development may be done in synchronous parallel with thehardware development (instead of serially).

This includes modifying the interpreter base to include new key features(VMF, plural micro-parallel expressions, etc.), modify key features(make “Object” match “Variable”, etc.), as well as use the new base forproviding the necessary test suite, OS and compilers.

The method also allows the hardware design to commence even if some ofthe more esoteric but not performance critical Op Codes are definedlater in the project as their implementing codes are destined to stayeither in software or microcode. An example may be the issue ofresolving polymorphisms of strings and new security features in OS tablesetup for a basic mapped Variable ID schema chosen for the instructionset.

Special attention should be paid to hardware and microcode features thatmaximizes the local autonomy of the computing device and minimizes tripsup and down through the operating system governing diarchy includinginteractions through shared memory spaces. An example of such a featureis the Mentor roster circuit presented herein.

For insight into the overwhelming size of this issue consider the DARPA2008 report page 63:

-   -   As a historical context: the Cray XMP, the world's fastest        supercomputer in 1983-1985, was a “balanced machine” . . . .        This explains why the XMP would spend about 50% of its time        moving data on a memory intensive code, but BG/L may spend 99%        of its time moving data when executing the same code.

Thus, in comparing the IBM BG/L to the Cray XMP on equal footing of workper cycle the IBM POWER processors when used as in the BG/L “work team”,the “work team” is only 2% efficient as compared to the work done by asingle processor (Cray XMP). To restate the report, in order to keep 2units effectively deployed doing the “end work” 98 additional units aredeployed in administrative paperwork and data shuffle. This inefficiencyis not a result of vector versus scalar instructions as both the CRAYXMP and the IBM POWER chip used in the IBM BG/L have vectorcapabilities.

Variable Names Based on Base Address and Other Mapping Options

As noted earlier in the presentation we chose to provide a presentationbased on specific choice of system parameters, including a 256 namespacefor Variables, 16 Indices 48 Mentors, 4K Frames etc. It must be howeveremphasized that this selection of method and numbers is done only toprovide examples; an explanation without using specific numbers becomesvague and hard to follow.

The Variables namespace and number of Indices may be made larger orsmaller depending on architecture implementation choices, for example a4K Variable logical ID namespace size (instead of 256). One of theVariable logical ID choices is to (as normally is presently done) usethe Variable's base address in memory as the Variable's ID, and forexample use a 16 bit ID in an addressing structure that uses 64 K memoryblocks, where a 32 bit memory full memory address contains a 16 bit LSBand 16 bit block address (MSB). The use of memory addresses as theVariable's label makes for a less compact program but may have someprogram advantages as it is the same as current architectures where thememory location (single Variable) or the Variables base location(dimensioned Variable) are also the Variables' label.

Additional Functional Units and Configurations

The Content Addressable Memory (CAM) Functional Unit

Embodiments described herein may provide a simple structure to integratenew functional units. The following is a description of a FunctionalUnit of a new type.

In various embodiments, a content-addressable memory (“CAM”) functionalunit may include some aspects not typically associated with a classicalarithmetic/logic Functional unit.

-   1. The CAM unit like other memory structures is built out of a large    number of identical structures thus the spare parts strategy is    applicable to a single unit, only a single copy is sufficient to    retain the yield and self-repair options.-   2. While typical arithmetic/logical Functional Units are 64 bit    wide, the CAM FU may be 256 bit wide.

FIG. 25 illustrates one embodiment of a CAM functional unit. The CAMFunctional Unit (CAM-FU) in depicted in FIG. 23 is configured as anarray of 32×128 elements where each element consists of a search patternkey (in bytes), comparators, and degree-of-match elements. The result ofthe comparison between the byte in the key and the scanned pattern(match/no-match) is sent to the key pattern degree-of-match element.

128 search keys are loaded with search keys (up to 32 characters) anddegree-of-match levels. Degree-of-match “0” means that the key is notparticipating in the search. The degree-of-match setting is based on thenumber of character matches required for the circuit to raise the matchflag. A match of 30/32 indicates that a match of at least 30 out of 32bytes to the characters in the key. Similarly a match for the setting of12/12 indicates a full match to a 12 characters key.

After the CAM-FU is set up with a set of search keys, the searched databyte stream is paraded in front of the CAM-FU searching for matches tothe 32 character keys. An advantage of this design is that thedegree-of-match per key can be adjusted such that in early stages of thesearch not only exact match can be detected but also close match can beincluded such that the search keys can be adjusted iteratively. Forworking with DNA codes the mechanism should allow 6 bit characters (DNAcharacter) shift and 2 bit shifts (one base pair shift).

The size of 32×128 sensitivity adjusted CAM is just our first guess;experience should show what the most practical CAM dimensions are.

In the iterative adjustment of the keys, the system attempts to solvetwo problems (1) finding the location of pattern(s) in the stream and(2) finding how the pattern(s) are expressed.

In the CAM-FU example, a hardware FU circuit may perform in a singlecycle an operation that might take several thousands of cycles whenusing standard arithmetic/logic primitives.

I/O

The following are examples of two different types of I/O interfaces thatmay be used in various embodiments.

First type: processor-to-memory I/O interface, where I/O devicesinteract with memory under the supervision of the BIOS program in theprocessor, similar to present machines.

Second type: Mentor based “infinite Variable” I/O interfaces channels.The high speed I/O Mentor based interface, once initialized actsindependently of the current algorithm during data transfers. TheMentors may supervise the high speed I/O channels during I/O transfers;the I/O FUs are responsible for operating the appropriate transferprotocol(s). Each I/O FU type is for example designed to conform to aGigabit era high speed serial protocol or an instrumentation protocoland depending on the protocol may include physical provisions forinterconnecting to fiberoptic line(s). The specific protocols currentlyincluded are Giga bit Internet, SONET and InfiniBand.

The proper inclusion of the “infinite Variable” types allows one toinclude processor to processor links via InfiniBand or Internet linksand the crossbar switches to be part of the basic programming tools in amultiprocessor environment thus relieving the system from some of the“Memory Wall” problems created by the fact that processors presently mayonly cooperate through shared memory, since memory is the only “world”programmers may see through their current processor's namespace.

Using the Mentor based interfaces the data accessed throughcommunications, point-to-point (InfiniBand, etc.) and instrumentationinterfaces the I/O accessed information is part of the Variable spacevisible to the programmer and the high speed interfaces allow directBin-to-Bin and Bin-to-I/O transfers. Many system approaches are farbetter served both in computing and energy efficiency through direct,high speed, Bin-to-Bin information transfers, bypassing Main Memoryaltogether as well as having communications and instrumentation as wellas main memory part of the processor's recognized operative space (forexample, a Variable somewhere in the “cloud” is part of your processor'snamespace).

There are presently several technologies for serially encoded, highspeed I/O interfaces, close range InfiniBand, and long range fiber opticWDM Wave Division Multiplexing, DWDM Dense Wave Division Multiplexing,SONET and Ethernet. InfiniB and is a common lower cost interface, butmay be limited to short physical distances for chip-to-chipinteractions.

The high speed interconnect operations are done by logically connectingthe I/O channels through the interconnect matrix to the appropriateBins. The Bin to/from high speed I/O channel connection through thecrossbars stays in place for the duration of channel data transfer, thisoperational form is different than arithmetic and logic Functional Unitsoperations that may rearrange crossbar connectivity each cycle by theDynamic VLIW control.

During high speed interconnect operation the appropriate Bin(s) arereserved for exclusive interconnect use for the duration of theoperation. The high speed interfaces allow data chaining and commandchaining.

Clock Synchronization

In some embodiments, processor groups may deploy a de-journal clocksynchronization used in SONET telecommunications. Clocks are notsynchronized to a master clock on each clock tick, a difficult task in acomputer room and impossible across multiple, physically separatedsystems. Instead, the clock circuits in each physical subsystem (cardsor racks) may use the telecommunication SONET standards andtelecommunication industry clock design approach with the same ordifferent de-journal parameters, where the processor's clock frequencybased on synchronization points has to be adjusted to meet the toleranceof a small credit/debit range in clock ticks per day. This approachensures that the clock circuits agree in number of clocks emitted over agiven time durations (millisecond, second, minute hour, etc.,) theparameters that are chosen by the integrated system design.

This technique provides a guaranty that, in workgroups, one processor isnot leading or lagging the rest of the group in data transfer operationsas this may cause a need for unlimited amount of buffer storage, insteadthe de-journal clock synchronization provides for maximum buffer sizeneeded for cooperative processors work. This basic SONET technique istypically obeyed for operating fiberoptic communications amongphysically separate system modules across the globe.

Alternate Embodiment #1

An alternate embodiment (referred to herein as Alternate Embodiment #1)is similar to the embodiments described above, with the exception of adifferent functional emphasis: the majority of arithmetic and logicunits are replaced by high speed I/O Functional Units, for example 16Functional Units each having one high speed serial input and one highspeed serial output. The I/O Functional Units operate InfiniB and, DWDMfiberoptic Ethernet or DWDM fiberoptic SONET interfaces, depending onthe model implementation choices.

In Alternate Embodiment #1, units serve as interconnect hubs, controlsand flow monitors traffic in production groups, interaction buffers, andorganization of large memory banks. In this embodiment, the processormay have a separate program memory for security reasons. The programmemory may be inaccessible to electronic data transfers throughout thesystem. This feature allows the processor to operate with the assurancethat no electronic means may compromise the programs.

Alternate Embodiment #1 may viewed as a communication switch or arouter, however an advantage of the architecture definition is theinclusion of external data streams as infinite Variables such that bothdata processing and communications can use the same programmingparadigms throughout systems and programs and Variables can reside in asingle processor or be literally distributed among many processors inphysically separate systems and still be viewed and programmed in thesame Variable namespace of a unified system.

Alternate Embodiment #2

In some computing applications, the parallel nature of the computationis most discerned in the innermost loops. This type of application maybe also referred to as micro-parallelism. Other problems may, however,display massively parallel nature only at a gross level (macroparallelism). Examples are commercial transaction processing such asairline reservations, customer transactions in the banking industry,online retail transactions; transactions of “cloud computing” storageservices, etc., those transactions include massively parallel, computingsystems interacting with millions of users.

The macro parallel nature of this type of massively parallel computingis typically present only at the full transaction level; each individualcustomer has to be handled individually. This type of computing rarelyengages a significant amount of floating point processing, or evensignificant local, parallel, fixed point processing. The individualtransactions perform a small amount of computation and mostly entailcontext switching among a plethora of routines; the parallel nature isin the millions of users engaged in the transactions, or in case of“data mining” the work typically involves sequential search for fit to apattern(s) per individual record. The parallelism in the examination ofbillions of individual information records in search of patterns is thatthe work is done in parallel by multiple processors.

In the macro-parallelism or most personal use environments there may beno significant performance advantages for having multiple copies offunctional units (thought gaming and graphics are quickly changing thispicture). In another alternate embodiment (referred to herein asAlternate Embodiment #2), a data structure includes only a single copyof each functional unit. This embodiment may be an implementation of thegeneral architecture described previously, in this case using its ownversion of the instruction set or using an instruction set compatiblewith the versions described above containing multiple FUs.

As with the embodiments described above, the computing device may, inthis embodiment, use a three level control structures; the LU level(multiple instruction interpretation per cycle), the dynamic VLIW level(issuing actions on behalf of multiple instructions in terms ofVariables operation) and the Mentor level (mapping Variable IDs tophysical operand addresses for main Memory and for the Bins). The datastructure may contain a Frames/Bins structure to accommodate VariableLHSM functions including Variable's cache operations, speculative undo,etc.

In this embodiment, rather than using, for example, four main Memorybuses, there may be only a single bus. The single bus significantlyreduces the number of I/O interconnects such that the computing devicemay be implemented using the silicon process and packaging techniquescommon to present IPL processors rather than requiring more demandingdesign considerations associated with PIM or other technologies. Theinternal parts spares strategy may or may not apply as the main elementsleft having potential multiple copies are the Frames/Bins structure andMentor circuits.

The change from 8 sets of FUs to a single set of FUs may indicate achange from the spares parts strategy to a strategy befitting anensemble of unique parts. In this case the Mentor section may be a setof physical Mentors or a structure that architecturally may beenvisioned as a set of N copies of the Mentor circuit but the“N-Mentors-circuit” may be designed as a single unit having memory banksthat hold individual Mentor information interconnected to a set ofshared indexing arithmetic/logic elements, where the number of indexingarithmetic/logic elements is dictated by the indexing and bounds checkwork flow.

The size and number of Frames in this alternate embodiment may besimilarly optimized to provide a proper fit to the work flow.Multithreading and conscious thread strategy may be emphasized to enableOS and application programs environment each to have an active responsecontext switch for active threads.

The ALL HLL Statement

In some embodiments, a high level language implements an ALL statement(or another word with equivalent effect as further described herein). Inthis implementation, the order of doing the operations is irrelevant tothe intent of the program. The machine should perform the operations inthe order that will maximize performance.

In existing high level languages (HLLs), there may be no distinctionbetween the use DO (or its equivalent FOR) for true sequentialdependency as it exists in a Fibonacci sequence computation (SeeFIBONACCI below) and in DO use in algorithms where the DO statement andthe DO index are used as linguistic means of enumeration, whereenumeration is strictly as a way of identifying the number of operandsas in the case of the following ADD-ARRAYS example.

In the suggested improvement the programmer uses the ALL instead of DOto indicate enumeration for the sake of operand naming only. Also thecompiler, when recognizing that the DO indexing is used strictly foroperand ID automatically replaces the DO (or FOR) phrases with ALLphrases in

ADD-ARRAYS, so it appears as:

BEGIN ADD-ARRAYS;

ALL I=1, P;

D(I)=A(I)+B(I)+C(I);

END ALL;

END ADD-ARRAYS;

If however a program contains the statement:

BEGIN FIBONACCI;

A(1)=K; A(2)=L;

ALL I=3, N;

A(I)=A(I−1)+A (I−2);

END ALL; END FIBONACCI;

The ALL phrase in FIBONACCI is flagged as an error due to operanddependencies and the error message specifies the offending entries as; A(I−1) and A (I−2).

Adding the ALL phrase, especially in complex programs relates toperformance and to quickly resolving issues where the programmer expectshigh performance due to the parallel nature of the code and receivespoor performance results.

Using ALL, the programmer can clearly distinguish between a case wherethe compiler did not recognized the parallel nature of the code and thecase where the compiler recognizes the parallel nature of the code buteither the object code or the hardware are doing a poor job in takingadvantage of it. So, while one does not normally classify slower thanexpected performance as errors unless real time systems are involved,bringing coherency to the picture by using expanded semantic isfundamental to robustness as well as a tool for getting performance ontrack.

A linguistic look at the problem shows one that most present HLLs andmachine languages do not have a “plural form” in the language and usethe “N times singular” where a plural form should be used. The ALLstatement is by definition a plural form, indicating that the operationsmay be performed all in parallel.

In the example above, augmenting HLL semantics with the ALL word mayenhance robustness and along the way performance. The incorporation ofthe corresponding parallel activity in the system described above mayenable the computing device, that includes the corresponding machinelanguage ALL Op Code to take full advantage of parallelism in the codeas during run time the hardware may not have sufficient time to analyzethe code, especially in a complex program, and discern that there are nooperand dependencies barring parallel operations among the “iterations”in the loop, as all the “iterations” may be performed in parallel or anycombination of serial and parallel. Specifically the so called“iterations” in case of ALL are not “iterations” but a linguistic methodfor including the “plural” form in the HLL and in the machine language(like a “cold shoulder” having little to do with either).

Simple Relaxation Algorithm

This example was chosen due to the fact that it is a simpledemonstration of issues (and provides solutions) relating to lackingplural form, issues existing in complex modeling algorithms wheretypically the algorithms are much more complicated and the problems showout in code porting and are hard to isolate and demonstrate.

The following example demonstrates a problem one may encounter whenmoving code from sequential to parallel operation and the results,differ causing a compatibility problems, tracing the problem shows thatthe source of incompatibility is that sometimes the problem is in “thenew results are too accurate”. The problem is important due to the factthat it is part of the need for the inclusion of “plural properties” incomputer language code both in HLL and machine language.

In a simple relaxation algorithm the value of a cell in a matrix iscomputed by averaging the four neighboring cells. In this example assumewe are using the simple relaxation algorithm to model the behavior of asurface of a water pond after we dump a cup of water in its middle. Forthe example we chose a square pond and are using a 14×14 cell matrix “A”to represent the ponds water surface. All surface value points areinitially set at “0” with exception of the center A (7,7) where adisturbance of 10,000 is introduced.

The mathematical equation representing the simple relaxation algorithmis:

A(i,j)=(A(i,j−1)+A(i−1,j)+A(i+1,j)+A(i,j+1))*0.25

A simple way to convert the simple relaxation algorithm to a program foran N×N (14×14) matrix “A” is:

-   -   T=2; DO j=1, N; DO i=1,N;    -   A(i,j)=0;    -   END DO; END DO;    -   A(7,7)=10000; M=N−1;    -   DO k=1,T; DO j=2, M; DO i=2,M;    -   A(i,j)=(A(i,j−1)+A(i−1,j)+A(i+1,j)+A(i,j+1))*0.25;    -   END DO; END DO; END DO;

Simple Relaxation Algorithm A

There are two artifacts created by the fact that the computer algorithmis using a substitution instead of solving the equation presented by themathematical model.

Specifically the algorithm computes one cell at a time while in thephysical system all the cells are acting simultaneously. This causes twotypes of problems; the first problem is that the sequential executioncauses a “smear effect”, which introduces an error into the computedresults. The second problem is performance degradation due to operanddependency caused by the sequential relation introduced not by theoriginal physical model but by the algorithm.

The error artifact is due to the fact that the value of A(i,j−1) andA(i−1,j) are the results of present iteration, while the valuesA(i+1,j)and A(i,j+1) are the results of previous iteration. Some mayrecognize the issue as a “version control” mix up.

A correct model, true to the modeled physical phenomena should make alliteration K values base on the results obtained in iteration K−1, doingso will introduce no computation bias or smear errors as all new valueswill be independent of the sequence and time when they were obtained.

When we make the algorithm immutable to the iteration (stay true toversion control) as shown in the following Relaxation Algorithm B theresults are different as they do not have either computation bias orsmear artifacts.

DO j=1, N; DO i=1,N; A(i,j)=0;

END DO; END DO;

A(7,7)=10000;

M=N−1; DO k=1,T; DO j=2, M; DO i=2,M;

A1(i,j)=(A(i,j−1)+A(i−1,j)+A(i+1,j)+A(i,j+1))*0.25;

END DO; END DO; END DO;

DO k=1,T; DO j=2, M; DO i=2,M;

A(i,j)=(A1(i,j−1)+A1(i−1,j)+A1(i+1,j)+A1(i,j+1))*0.25

END DO; END DO; END DO;

Simple Relaxation Algorithm B

A basic change in the algorithm, specifically the inclusion of iterationversion discipline produces accurate presentation of the physicalphenomena and the results are devoid of the “smear” effects. Allcomputations are performed based on the results of the correct previousiteration(s). Relaxation algorithm B has introduced a computational“iteration version control” into the algorithm. Also, to avoid extracomputations and simultaneously achieve version control the inner DOloops are duplicated. This is done to avoid copying the array to atemporary array and then storing the results in “A” array.

The results for present iteration are computed based on data entriesfrom previous iteration. The results obtained after one iteration thatcorresponds to algorithm A two iterations, are symmetric and representthe ripple effects one would expect. The effective ripple spread periteration is the same in both algorithms A and B.

Another outcome is that Algorithm B produces the same resultsindependent of amount of parallel actions deployed. For example, fourprocessors running four quadrants of the matrix will produce differentnumeric results as compared to a run by a single processor when usingAlgorithm A. The four processors will produce the identical numericresults as a single processor when running Algorithm B. Thus Algorithm Bis portable to parallel systems.

Computation Speed Up

In addition to portability and accuracy one gets a significant speedimprovement in processors capable of deploying any form of parallelcomputations, and specifically micro parallelism. A superscalarexecution of Algorithm A will typically take about 16 cycles per A(i,j)computation due to operand dependency on previous operand, specificallyA(i−1,j). The three operations affected by operand dependencies are twoADD operations, the first set using the A(i,j−1), and the second oneusing the result of the first ADD and finally the and MPY (*0.25)operation.

The same ILP superscalar may perform a single A(i,j) of Algorithm Bcomputation in about 8 cycles if the IMMUTABLE (version control)discipline is enforced, about half of the time since the IMMUTABLEcomputation arrangement guaranties that there is no operand dependenciescaused by the algorithm.

Both operands A(j−1,i) and A(i−1,j) are responsible for introducingsmear errors into Algorithm A as they both bring the wrong iterationlevel values. However only A(i−1,j) is responsible computational delaysdue to operand dependency, as A(j−1,i) was computed many cycles earlier.

The compiler in an ILP processor code may generate additional operanddependencies if insufficient amount of work registers are assigned.

Performance improvement may be further gained by micro parallelism onceversion control that automatically abides by computation immutability isinstituted into the algorithm by the iteration version control statementin the HLL and machine languages. Please also note that whilecomputation immutability may be achieved declaring the arrays immutableand thus requiring two copies of the original array, other techniquerequiring less memory space may also work. Therefore one should notethat immutable arrays and version controlled IMMUTABLE code sections arerelated but they are not the same thing.

In context of the embodiments described herein, introducing versioncontrol methods into source code language, the subject is importantsince it brings the correct results to algorithms that previouslysuffered from “smear effects” and enables micro and macro paralleloperations in code that is presently classified as serial though thealgorithm may model an “embarrassingly parallel” process.

Method of Introducing Version Control into Present HLL Programs

The method of introducing operand version control described is throughthe use of an iteration revision control context statement into theHLLs. The advantage of a context statement is that it does not force orlimit the compiler to a single type of solution such as the use ofimmutable arrays. The statement only requires that either the compileror the user provide a solution that will pass muster in processing thealgorithm according to iteration version control discipline.

The following approach introduces a key word be it VERSION CONTROL,IMMUTABLE or another key word into a program closure in a form of:

-   -   BEGIN IMMUTABE    -   END IMMUTABLE    -   For example see the enclosure of Algorithm A below:    -   BEGIN IMMUTABLE;    -   DO k=1,T; DO j=2, M; DO i=2,M;    -   A(i,j)=(A(i,j−1)+A(i−1,j)+A(i+1,j)+A(i,j+1))*0.25;    -   END DO; END DO; END DO;    -   END IMMUTABLE;

The method of context enclosure request that the compiler verifies theimmutability of the algorithm and if it is not observed the compilershould make the necessary corrections, including interactionssuggestions how to modify the algorithm and other necessary changes.Therefore, it is the responsibility of the compiler to interact with theprogrammer and both suggest methods to achieve iteration level operandcontrol and certify to the hardware that the code section indeed doesnot have results distortion (smear) or operand dependency in arraycomputation.

The techniques deployed in Algorithm B are quite different fromimmutable array techniques deployed for example in the Haskell compiler.Where immutable arrays need to be basically “read only” for the durationof the algorithm the revision controlled or immutable algorithm needonly to guaranty that version control is not violated. One can achieveversion control through immutable arrays, Ping-Pong array computing asillustrated in Relaxation Algorithm B or several other methods.

The “Context Statement”, Increasing Lexicon to Accurately ConveyInformation

Better (pertaining to “conceptual units” and programming of conceptualunits properties) HLL and machine language lexicons (expanded languagethat includes information relevant to “plural form”) may simplify thehardware as well as software design, especially taking advantage of thelexicon rich context dependent approach to enable robust systems interms of error prevention/detection in the program debug cycle and inmalware detection resistance. In this Simple Relaxation section, acontext dependent approach also serves as a key to better design andmigration of software—independent of hardware consideration.

For example, the construct:

-   -   BEGIN IMMUTABE    -   DO j=2, M; DO i=2,M;    -   A(i,j)=(A(i,j−1)+A(i−1,j)+A(i+1,j)+A(i,j+1))*0.25;    -   END DO; END DO;    -   END IMMUTABLE

may be preferable to the following three “algorithmic” constructs:

Construct A

-   -   DO j=2, M; DO i=2,M;    -   A1(i,j)=(A(i,j−1)+A(i−1,j)+A(i+1,j)+A(i,j+1))*0.25;    -   END DO; END DO;    -   DO k=1,T; DO j=2, M; DO i=2,M;    -   A(i,j)=A1(i,j);    -   END DO; END DO; END DO;    -   COMMENT Forced iteration control by use of A1 as temporary;

Construct B

-   -   DO j=2, M; DO i=2,M;    -   A1(i,j)=(A(i,j−1)+A(i−1,j)+A(i+1,j)+A(i,j+1))*0.25;    -   END DO; END DO;    -   DO k=1,T; DO j=2, M; DO i=2,M;    -   A(i,j)=(A1(i,j−1)+A1(i−1,j)+A1(i+1,j)+A1(i,j+1))*0.25;    -   END DO; END DO; END DO;    -   COMMENT Alternate use of A and A1 for more efficient execution;

Construct C

-   -   DO j=2, M; DO i=2,M;    -   C=B(i);    -   B(i)=(A(i,j−1)+A(i−1,j)+A(i+1,j)+A(i,j+1))*0.25;    -   A(i,j−1)=C;    -   END DO; END DO;    -   COMMENT: A one dimension array the size of one column of A is        used as a temporary storage. This minimizes the extra temporary        storage in case of very large arrays, however the use of the        transfer variable C generates operand dependencies that slows        computation;

While the constructs A, B, and C each represents a “literal algorithm”,each has disadvantages. The form using, IMMUTABLE context lexiconextension presents the algorithm in unambiguous terms both as to statingthe true intent and of producing the correct results.

Full Parallel Operation of Simple Relaxation Algorithm

FIG. 26 is a flow diagram of array parallel processing for

A1(i,j)=(A(i,j−1)+A(i−1,j)+A(i+1,j)+A(i,j+1))*0.25;

using two outputs one input arrays achieving the performance of oneA1(i,j) computation per single cycle.

The result of using the combination of immutable algorithms and microparallelism produces accurate and true results with a performanceimprovement factor of about 16 over an ILP processor for the simplerelaxation algorithm, this type of micro parallelism may be used by thepipeline and parallel (8+1) FU data structure.

DONA Instruction Long Range View

In some embodiments, DONA (Direct Operand Namespace Architecture) isemployed to create a machine level language that meets the goals of the“Variables” model. Three areas in the “Variables” model may be addressedby DONA:

-   -   The data domain    -   The program scope domain    -   The mapping domain

In the data domain area DONA provides a machine language that iscognizant of the concept “Variables”. In this context, variables aredefined as data entities and mapped to the physical address spaces theyoccupy in memory and in the Bins. While the DONA definition should beindependent of specific implementation of processor hardware such asnumber of Mentors or the size of Bins, the Mentor/Bin concept isintrinsic to the DONA language as it provides the supporting mechanismin hardware for the concept of “Variables”.

The program scope domain asserts that in the Variables environment“instruction interpretation is extended to the control section“comprehending” program segments such as entire inner loop or smallroutine this has both performance and energy efficiency advantages asthe machine, for example, need not re interpret instructions on eachiteration.

The mapping domain asserts that the use of “Variables” deploys namespacemapping techniques to both keep program size small and use thename-tables in support of bounds checks and other software robustnesscapabilities.

FIG. 27 illustrates an example of work flow in synchronized hardware andsoftware development based on the use of C++ as software migration basefor the DONA machine instruction set as well as the very definition ofthe DONA instruction set. The computing device hardware developmentcontains many complex tasks including selection of silicon process,clock cycle, pipeline strategy, CAD choices, etc. the followingdiscussion highlights only the DONA definition and software migration.

At 200, the “mapped” Variable name option is selected. In this example,the mapped Variable name option is chosen over “Name=Memory−Address”.The mapped option has advantages in code compactness, robustness,security and moving away from centralized to local processor governancein the OS. “Name=Memory−Address” method is also in conflict with theimplementation of “Infinite Variables” and the inclusion of network(s)and instrumentation in the basic processor model (those Variables arenot in Memory).

At 202, the Interpreter base (C++, Visual C++, Java, etc.) is selected.For purposes of the descriptions of the work flow in FIG. 25, C++ isassumed.

At 204, the functional and performance test suites (LINPACK-HPL, STREEM,PTRANS, etc.) are selected.

At 206, a “Host” system is selected. The two basic criteria are that itincludes a C++ interpreter, key applications and all test suites run onthe host and that a major portion of OS is coded in C++.

At 208, the responsibilities of the Mentor and the responsibilities ofthe Dynamic VLIW (or main control) are defined. In addition, allofficial datatype including polymorphisms (issues based contextdependent data definition) may be defined. “Version 1” of DONA themachine-language formats and Op Codes for code proper, Mentor setup,addressing formulas, Mentor setup and VMF formats are also defined.

At 210, “computing element” hardware design work is continued/commencedusing “Version N” information.

At 212, “Machine language level” assembler implementation iscommenced/continued by modifying the C++ interpreter into a compilertype program where the “production” portion generates “Version N” DONAcode. Portions of the C++ functions in the interpreter that are notperformance critical may stay as a part of the assembler and thus notmove to the hardware and/or Dynamic VLIW microcode of the computingdevice. The intent is to provide the best DONA definition, which is notthe same objective as making the “computing element” a C++ processor.

At 214, simulator design is commenced/continued. Simulator design may bedone by modifying the C++ interpreter to accept “Version N” code.

At 216, interface language and other parameters of test suite partimplemented in C++ are updated as needed (e.g., per current DONAdefinition.)

At 218, C++ coded portion of benchmark test suite is run on Version N.

At 220, review events are conducted. Reviews may be in the form ofscheduled and impromptu project review events, including synchronizedreviews for the release of Version N+1 of DONA, the computing element“machine language definition”. The reviews in a chip developmentschedule may be typically quarterly and/or based on scheduled completionof specific goals such as FU logic design completion, data structurelogic design completion, etc.

At 222, port all needed production portion of the compilers andinterpreters, (FORTAN, Java, etc.) to produce “Version P” DONA code.

At 224, all tests are run, as well as all other tasks in productdelivery.

The term “Object Oriented systems” is discussed herein in a broader viewthan the definitions given in technical dictionaries and textbooks.According to our view the first computer system to use the Objectsparadigm was the 1961 Bouroughs-B5000 conceived by Robert S. Barton,built years before the term Object Oriented was coined. The B5000 had 51bit word, 48 data and 3 for the rights/properties descriptor.

The Bundle of Rights Paradigm

The Object Oriented model that was developed in the 70's may beconsidered a variation of the legal real estate “bundle of rights”model.

Only sovereign entities (countries) “own” real estate properties. Whatthe rest of us may own is a “bundle of rights” to a parcel, where others(city, utility) may own easements and other rights to the same parcel(tax, etc.) The bundle of rights may not for example include for examplethe right to operate a grocery store in your back yard.

In addition to rights, the parcel has natural properties; it may beflat, hilly, rocky and may for example border a river which may bestowon it rights to use river water. So “rights” and “properties” areinterlinked.

The reason for this broader view is that present Object Oriented systemsusing “Object, Class, Abstraction, Encapsulation, Inheritance,Polymorphism, Overloading” are designed specifically for the needs ofOperating Systems coders. As long as OS software only is involved,classifying Object Oriented as belonging to the “bundle of rights”paradigm is mostly philosophically interesting.

However we have included in the hardware of the computing device theMentor/Bins as the Mechanism to implement the “bundle of rights”paradigm. We use the C++ (or other OS Object software base) for thesoftware porting task. It is clear however that the same Mentor/Binsstructure may be very useful in the construction of computing systemswith different sets of rights/properties than the C++ Object Orientedset for implementing for example franchise operators software, realestate escrow system, or airplane design and simulation system whereproperties of an airframe part are associated with the material(aluminum, carbon-fiber), natural resonance of the part, icingresistance of the coating, etc.

Once various data types have been defined for C++, why should not thesame data types be used for all other computer languages? For example,for the Fortran used in coding the airframe simulator? The data typesand instructions can certainly be used to do the Fortran tasks, but thatcan also leads to large inefficiencies and frustration in codingprograms.

Consider the case that in the C++ choices; selected for OS coding tasksall forms of parallelisms (macro and micro) may be converted to multithreads for past compatibility reasons, and all shared files encryptedfor OS security reasons.

Those choices however will produce extremely inefficient code in Fortransince the micro parallelism capabilities of the hardware are not used,and in the shared files that are used for communications between themechanical drawing and the finite element simulator security is not anissue, but the overhead is daunting. To optimize the code, anaeronautical engineer may have to code his program in C++, a language hemay find hard to learn and full of esoterica, useless for his task.

In some multiprocessor systems, the lion's portion of the overhead maybe associated with the fact that, unless programs share space in amemory, communications among them may take place only through highly“bureaucratic” protocol laden OS I/O interactions. In system containinginfinite Variables the OS may set communications rules and associatetransmission standards (for encryption, error response, etc.) but oncecommunications links are authorized by the OS, application layerprograms in different processors may directly transfer/receive datato/from each other subject only to the existence and limits of thephysical/logical links.

By way of analogy, consider a situation in which the US Post Office isbe limited to delivering mail in the US. A letter to a friend in Athensis sent to: Friend's address in Athens, Colo.: Hellenic Republicembassy, 2217 Massachusetts Ave NW, Washington, D.C. 20008. A clerk inthe embassy goes through each letter to makes sure the letter passesGreek spelling and grammar checks before forwarding your letter to yourfriend in Athens. Under the system described herein, in effect, theGreek and US embassies might agree to use the international post officeprotocol and both USA and Greek postal systems just ship each othermailbags. The corresponding embassies are free to tend to diplomaticmatters.

Computer systems may, in various embodiments, include components such asa CPU with an associated memory medium such as Compact Disc Read-OnlyMemory (CD-ROM). The memory medium may store program instructions forcomputer programs. The program instructions may be executable by theCPU. Computer systems may further include a display device such asmonitor, an alphanumeric input device such as keyboard, and adirectional input device such as mouse. Computer systems may be operableto execute the computer programs to implement computer-implementedsystems and methods. A computer system may allow access to users by wayof any browser or operating system.

Computer systems may include a memory medium on which computer programsaccording to various embodiments may be stored. The term “memory medium”is intended to include an installation medium, e.g., Compact Disc ReadOnly Memories (CD-ROMs), a computer system memory such as Dynamic RandomAccess Memory (DRAM), Static Random Access Memory (SRAM), Extended DataOut Random Access Memory (EDO RAM), Double Data Rate Random AccessMemory (DDR RAM), Rambus Random Access Memory (RAM), etc., or anon-volatile memory such as a magnetic media, e.g., a hard drive oroptical storage. The memory medium may also include other types ofmemory or combinations thereof. In addition, the memory medium may belocated in a first computer, which executes the programs or may belocated in a second different computer, which connects to the firstcomputer over a network. In the latter instance, the second computer mayprovide the program instructions to the first computer for execution. Acomputer system may take various forms such as a personal computersystem, mainframe computer system, workstation, network appliance,Internet appliance, personal digital assistant (“PDA”), televisionsystem or other device. In general, the term “computer system” may referto any device having a processor that executes instructions from amemory medium.

The memory medium may store a software program or programs operable toimplement embodiments as described herein. The software program(s) maybe implemented in various ways, including, but not limited to,procedure-based techniques, component-based techniques, and/orobject-oriented techniques, among others. For example, the softwareprograms may be implemented using ActiveX controls, C++ objects,JavaBeans, Microsoft Foundation Classes (MFC), browser-basedapplications (e.g., Java applets), traditional programs, or othertechnologies or methodologies, as desired. A CPU executing code and datafrom the memory medium may include a means for creating and executingthe software program or programs according to the embodiments describedherein.

A computing system may include, and/or may be implemented as, multiplefunctional modules or components, with each module or componentincluding one or more resources (e.g., computing resources, storageresources, database resources, etc.). A system may include more or fewercomponents or modules, and a given module or component may be subdividedinto two or more sub-modules or subcomponents. Also, two or more of themodules or components can be combined.

Further modifications and alternative embodiments of various aspects ofthe invention may be apparent to those skilled in the art in view ofthis description. Accordingly, this description is to be construed asillustrative only and is for the purpose of teaching those skilled inthe art the general manner of carrying out the invention. It is to beunderstood that the forms of the invention shown and described hereinare to be taken as embodiments. Elements and materials may besubstituted for those illustrated and described herein, parts andprocesses may be reversed, and certain features of the invention may beutilized independently, all as would be apparent to one skilled in theart after having the benefit of this description of the invention.Methods may be implemented manually, in software, in hardware, or acombination thereof. The order of any method may be changed, and variouselements may be added, reordered, combined, omitted, modified, etc.Changes may be made in the elements described herein without departingfrom the spirit and scope of the invention as described in the followingclaims.

1-56. (canceled)
 57. A computing device, comprising: a main memory; alocal high speed memory configured to implement a frames/bins structure,wherein the local high speed memory comprises: a plurality of frames,each of at least two of the frames comprising a physical memory element;a plurality of bins distributed in the plurality of frames, wherein eachof at least two of the bins comprises a logical element; one or morefunctional units configured to perform operations relating to one ormore variables stored in at least one of the bins, wherein each of atleast one of the Variables comprises one or more words; one or moreinterconnects between the main memory and the local high speed memory;and one or more interconnects between the local high speed memory andthe one or more functional units. 58-60. (canceled)
 61. The computingdevice of claim 57, wherein each of at least one of the bins isdistributed across two or more of the frames.
 62. The computing deviceof claim 57, further comprising one or more mentor circuits configuredto control operations using at least one Variable stored in at least oneof the bins. 63-69. (canceled)
 70. The computing device of claim 57,further comprising one or more tarmac registers, wherein the tarmacregisters are configured for staging of at least one operation.
 71. Thecomputing device of claim 57, wherein the processor is configured toperform speculative execution on at least one Variable. 72-78.(canceled)
 79. The computing device of claim 57, wherein at least one ofthe variables corresponds to a physical device used or controlled by thecomputing device.
 80. The computing device of claim 57, wherein at leastone of the Variables corresponds to instrumentation or a sensor used orcontrolled by the computing device. 81-82. (canceled)
 83. A method ofcomputing, comprising: providing a plurality of frames in a high speedlocal memory, wherein the local high speed memory is coupled to a mainmemory by one or more interconnects and to one or more functional unitsby one or more interconnects, wherein each of at least two of the framescomprises a physical memory element; and storing a plurality ofvariables in one or more bins distributed in the plurality of frames,wherein each of at least two of the bins comprises a logical element;and performing, by a processor, one or more operations relating to atleast one of the one or more variables stored in at least one of thebins, wherein each of at least one of the variables comprises two ormore words. 84-88. (canceled)
 89. The method of claim 83, wherein eachof at least one of the bins is distributed across two or more of theframes.
 90. The method of claim 83, further comprising controlling, byone or more mentor circuits, operations using at least one Variablestored in at least one of the bins. 91-92. (canceled)
 93. The method ofclaim 83, wherein at least one of the mentor circuits is configured toimplement a self-bounds check for at least one of Variables.
 94. Themethod of claim 83, wherein at least one of the mentor circuits isconfigured to implement an intrusion bounds check for at least one ofVariables.
 95. The method of claim 83, wherein one or more interconnectsbetween the local high speed memory and the one or more functional unitscomprise: one or more frames/bins interconnect circuits going to the oneor more functional units; and one or more frames/bins interconnectcircuits coming from the one or more functional units. 96-97. (canceled)98. The method of claim 83, wherein performing at least one operationcomprises staging at least one operation using a tarmac register. 99.The method of claim 83, further comprising performing speculativeexecution on at least one variable. 100-104. (canceled)
 105. The methodof claim 83, wherein each of at least one of the variables comprises aset of operands.
 106. The method of claim 83, wherein each of at leastone of the Variables corresponds to a conceptual unit.
 107. The methodof claim 83, wherein at least one of the Variables corresponds to aphysical device used or controlled by the computing device forinteractions with instrumentation.
 108. The method of claim 83, whereinat least one of the Variables corresponds to a functional element of asystem used or controlled by the computing device for providingcommunications links.
 109. A computing device, comprising: a mainmemory; a local high speed memory comprising one or more bins, whereineach of at least two of the bins is configured to store a Variable; oneor more functional units configured to perform operations relating toone or more Variables stored in the local high speed memory; one or moreinterconnects between the main memory and the local high speed memory;one or more interconnects between the local high speed memory and theone or more functional units; and one or more mentor circuits configuredto control operations relating to at least one Variable stored in atleast one of the bins. 110-186. (canceled)